codevoid.de/1/hn/comments_46272487.gph

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Avoid UUID Version 4 Primary Keys in Postgres
       
       
        tom_m wrote 10 hours 27 min ago:
        It depends. What a narrow minded article, sigh. Bring on the vibe
        coding! Let's just be mindless.
       
        marcus_holmes wrote 11 hours 11 min ago:
        I didn't see my primary use case for UUID's covered: sharing
        identifiers across entities is dangerous.
        
        I wrote a CRUD app for document storage. It had user id's and document
        id's. I wrote a method GetDocumentForUser(docID, userID) that checked
        permissions for that user and document and returned the document if
        permitted. I then, stupidly, called that method with
        GetDocumentForUser(userID, docID), and it took me a good half hour to
        work out why this never returned anything.
        
        It never returned anything because a valid userID will never be a valid
        docID. If I had used integers it would have returned documents, and I
        probably wouldn't have spotted it while testing, and I would have
        shipped a change that cheerfully handed people other people's
        documents.
        
        I will put up with a fairly considerable amount of performance hit to
        avoid having this footgun lurking. And yes, I know there are other ways
        around this (e.g. types) but those come with their own trade-offs too.
       
        Jeff_Brown wrote 12 hours 51 min ago:
        If you're really ambitious you'll use two UUIDs for the ID, because for
        an app in which at least a billion people have at least 327 million
        random v4 UUIDs, the probability of a collision will be greater than
        1%.
       
        deepsun wrote 13 hours 24 min ago:
        I still don't understand why people don't remove the hyphens from
        UUIDs. Hyphens makes it harder to copy-paste IDs. The only reason to
        keep them is to make it explicit "hey this is an UUID", otherwise it's
        a completely internal affair.
        
        Even worse, some tools generate random strings and then ADD hyphens in
        them to look like UUID (even thought it's not, as the UUID version byte
        is filled randomly as well), cannot wrap my head why, e.g:
        
  HTML  [1]: https://github.com/VictoriaMetrics/VictoriaLogs/blob/v1.41.0/a...
       
        dspillett wrote 15 hours 42 min ago:
        > Do not assume that UUIDs are hard to guess; they should not be used
        as security capabilities
        
        It is not just about being hard to guess a valid individual identifier
        in vacuum. Random (or at least random-ish) values, be they UUIDs or
        undecorated integers, in this context are also about it being hard to
        guess one from another, or a selection of others.
        
        Wrt: "it isn't x it is y" form: I'm not an LLM, 'onest guv!
       
        moomoo11 wrote 15 hours 58 min ago:
        Iâll do it regardless because any time Iâve tried to chase
        optimization early like this, hardware has always evolved faster.
        
        We have all faced issues where we donât know where the data will
        ultimate live thatâs optimal for our access patterns.
        
        Or we have devices and services doing async operations that need to
        sync.
        
        Iâm not working on mission critical âif this fails thereâs a
        catastrophic eventâ type shit. Itâs just rent seeking SaaS type
        shit.
        
        Oh no it cost $0.35 extra to make $100. Next year will make more money
        relative to cost increase.
       
        dizlexic wrote 18 hours 4 min ago:
        Meh, uuid go brrrrrrrrrrrrrrrrrrrrrrrrr
       
        Xorakios wrote 18 hours 16 min ago:
        I was born in the Vancouver Social Security office #538 area in 1962.
        
        Guess how much I have to defend against attackers trying 538-62-xxxx
       
        mcsoft wrote 18 hours 17 min ago:
        Snowflake or sonyflake ids work: [1]
        
  HTML  [1]: https://en.wikipedia.org/wiki/Snowflake_ID
  HTML  [2]: https://github.com/sony/sonyflake?tab=readme-ov-file
       
        burnt-resistor wrote 18 hours 31 min ago:
        UUID PKs are trying to solve the wrong problem. Integer/serial primary
        keys are not the problem so long as they're never exposed or usable
        externally. A critical failure of nearly every RESTful framework is
        exposing internal database identifiers rather than using encrypted ones
        preserving relative performance, creation order-preservation, and
        eliminating unowned key probing.
       
        p0w3n3d wrote 19 hours 48 min ago:
        What about newest postgresql support for uuidv7? Anybody did tests?
        This is what we're heading towards at the moment of writing so I'd like
        to ask to eventually roll back the decision
       
        AtNightWeCode wrote 19 hours 57 min ago:
        The counter argument I would say is that having all these integer ids
        comes with many problems. You can't make em public cause they leak
        info. They are not unique across environments. Meaning you have to spin
        up a lot of bs envs to just run it. But retros are for complaining
        about test envs, right?
        
        Uuid4 are only 224bits is a bs argument. Such a made up problem.
        
        But a fair point is that one should use a sequential uuid to avoid
        fragmentation. One that has a time part.
       
          kgeist wrote 19 hours 31 min ago:
          Some additional cases we encounter quite often where UUIDs help:
          
          - A client used to run our app on-premises and now wants to migrate
          to the cloud.
          
          - Support engineers want to clone a clientâs account into the dev
          environment to debug issues without corrupting client data.
          
          - A client wants to migrate their account to a different region (from
          US to EU).
          
          Merging data using UUIDs is very easy because ID collisions are
          practically impossible. With integer IDs, we'd need complex and
          error-prone ID-rewriting scripts. UUIDs are extremely useful even
          when the tables are small, contrary to what the article suggests.
       
            andatki wrote 14 hours 53 min ago:
            If merging or moving data between environments is a regular
            occurrence, I agree it would be best to have non-colliding primary
            keys. I have done an environment move (new DB in different AWS
            region) with integers and sequences for maybe a 100 table DB and
            itâs do-able but a high cost task. At that company we also had
            the demo/customer preview environment concept where we needed to
            keep the data but move it.
       
        m000 wrote 20 hours 14 min ago:
        Why not just use UUIDs as a unique column next to a bigint PK?
        
        The power and main purpose of UUIDs is to act as easy to produce,
        non-conflicting references in distributed settings. Since the scope of
        TFA is explicitly set to be "monolithic web apps", nothing stops you
        from having everything work with bigint PKs internally, and just add
        the UUIDs where you need to provide external references to
        rows/objects.
       
          mrkeen wrote 16 hours 47 min ago:
          Yes, if you're in the group of developers who are passionate about db
          performance, but have ruled out the idea of spreading work out to
          multiple DBs, then continuing to use sequential IDs is fine.
       
        kortex wrote 20 hours 28 min ago:
        I've been using ULIDs [0] in prod for many years now, and I love them.
        I just use string encoding, though if I really wanted to squeeze out
        every last MB, I could do some conversion so it is stored as 16 bytes
        instead of 26 chars. In practice it's never mattered, and the
        simplicity of just string IDs everywhere is nice.
        
        Sometimes I have to talk to legacy systems, all my APIs have str IDs,
        and I encode int IDs as just decimal left padded with leading zeros up
        to 26 chars. Technically not a compliant ULID but practically speaking,
        if I see leading `00` I know it's not an actual ULID, since that would
        be before Nov-2004, and ULID was invented in 2017. The ORM
        automatically strips the zeros and the query just works.
        
        I'm just kind of over using sequential int IDs for anything bigger than
        hobby level stuff. Testing/fixturing/QA are just so much easier when
        you do not have to care about whether an ID happens to already exist.
        
        [0]
        
  HTML  [1]: https://github.com/ulid/spec
       
          sedatk wrote 20 hours 24 min ago:
          See
          
  HTML    [1]: https://news.ycombinator.com/item?id=46211578
       
            kortex wrote 15 hours 49 min ago:
            The python implementation I use doesn't do this quirk. It's just
            timestamp + randomness in Crockford Base32. That's all I need. Sure
            it doesn't fully "comply with the spec" but frankly the sequence
            sub-millis quirk was a complete mistake.
       
        impoppy wrote 20 hours 30 min ago:
        This is such a mediocre article. It provides plenty of valid reasons to
        consider avoiding UUID in databases, however, it doesnât say what
        should be used should one want primary keys that are not easy to
        predict. The XOR alternative is too primitive and, well, whereas I get
        why should I consider avoiding UUID, then what should I use instead?
       
        p1necone wrote 20 hours 40 min ago:
        Being able to create something and know the id of it before waiting for
        an http round trip simplifies enough code that I think UUIDs are worth
        it for me. I hadn't really considered the potential perf optimization
        from orderable ids before though - I will consider UUID v7 in future.
       
          andatki wrote 15 hours 1 min ago:
          Great!
       
        miiiiiike wrote 20 hours 45 min ago:
        The article is muddled, I wish he'd split it into two. One for UUID4
        and another for UUID7.
        
        I was using 64-bit snowflake pks
        (timestamp+sequence+random+datacenter+node) previously and made the
        switch to UUID7 for sortable, user-facing, pks. I'm more than fine
        letting the DB handle a 128-bit int vs over a 64-bit int if it means
        not having make sure that the latest version of my snowflake function
        has made it to every db or that my snowflake server never hiccups,
        ever.
        
        Most of the data that's going to be keyed with a uuid7 is getting
        served straight out of Redis anyway.
       
        maxnoe wrote 20 hours 59 min ago:
        Would this argument still apply if I need to store a uuidv4 anyway in
        the table?
        
        And I'd likely want a unique constraint on that?
       
        phendrenad2 wrote 21 hours 2 min ago:
        You probably don't want integer primary keys, and you probably don't
        want UUID primary keys. You probably want something in-between,
        depending on your use case. UUID is one extreme on this spectrum, which
        tries to solve all of the problems, including ones you might not have.
       
          pil0u wrote 20 hours 45 min ago:
          What's in-between? I posted the article because I'm in the middle of
          that choice and wanted to generate discussion/contradiction.
          
          So far, people have talked a lot about UUIDs, so I'm genuinely
          curious about what's in-between.
       
            phendrenad2 wrote 12 hours 45 min ago:
            An example would be YouTube's video IDs. It's custom-fit for a
            purpose (security: no, avoiding the problem where people fish for
            auspicious YouTube video numbers or something: yes).
            
            Another example would be a function that sorts the numbers 0
            through 999 in a seemingly random order (but's actually
            deterministic), and then repeat that for each block of 1000 with a
            slight shift. Discourages casual numeric iteration but isn't as
            complex or cryptographically secure as UUID.
       
        can3p wrote 21 hours 18 min ago:
        > For many business apps, they will never reach 2 billion unique values
        per table, so this will be adequate for their entire life. Iâve also
        recommended always using bigint/int8 in other contexts.
        
        I'm sure every dba has a war story that starts with similar decision in
        the past
       
        Waterluvian wrote 21 hours 44 min ago:
        Any decent resources with benchmark data on Postgres insertion,
        indexing, retrieve, etc. for UUID vs. integer based PKs?
       
        whateveracct wrote 21 hours 52 min ago:
        Sometimes its nice for your PK to be uniformly distributed. As a
        reader, even if it hurts as a writer. For instance, you can easily
        shard queries and workloads.
        
        > the impact to inserts and retrieval of individual items or ranges of
        values from the index.
        
        Classic OLTP vs OLAP.
       
        iainctduncan wrote 22 hours 16 min ago:
        Counterargument... I do technical diligence so I talk to a lot of
        companies at points of inflection, and I also talk to lots who are
        stuck.
        
        The ability to rapidly shard everything can be extremely valuable. The
        difference between "we can shard on a dime" and "sharding will take a
        bunch of careful work" can be expensive If the company has poor
        margins, this can be the difference between "can scale easily" and
        "we're not getting this investment".
        
        I would argue that if your folks have the technical chops to be able to
        shard while avoiding surrogate guaranteed unique keys, great. But if
        they don't.... a UUID on every table can be a massive get-out-of-jail
        free card and for many companies this is much, much important than some
        minor space and time optimizations on the DB.
        
        Worth thinking about.
       
          mrinterweb wrote 16 hours 51 min ago:
          I can see how sharding could be difficult with a bigint FK, but
          UUIDv7 would still play nice, if I understand your point correctly.
          Monotonically increasing foreign keys have performance benefits over
          random UUIDv4 FKs in postgresql is the point of the article.
       
          jessep wrote 20 hours 42 min ago:
          Sort of related, but we had to shard as usage grew and didnât have
          uuids and it was annoying. Wasnât the most annoying bit though.
          Whole thing is pretty complex regardless of uuid, if you have a
          highly interconnected data model that needs to stay online while
          migrating.
       
            iainctduncan wrote 20 hours 40 min ago:
            Right, but if you start off with uuids and the expectation that you
            might use them to shard, you'll wind up factoring that into the
            data model. Retrofitting, as you rightly say, can be much harder.
       
        FancyFane wrote 22 hours 24 min ago:
        Even MySQL benefits from these changes as well. What we're really
        discussing is random primary key inserts (UUIDv4) vs incrementing
        primary key inserts (UUIDv6 or v7).
        
        PlanetScale wrote up a really good article on why incrementing primary
        keys are better for performance when compared to randomly inserted
        primary keys; when it comes to b-tree performance.
        
  HTML  [1]: https://planetscale.com/blog/btrees-and-database-indexes
       
        sneak wrote 23 hours 27 min ago:
        This is why ULID exists and why I use them in my ext_id columns.  For
        the actual relational IDs internal to the db I use smaller/faster data
        types.
       
        henning wrote 23 hours 31 min ago:
        UUIDs make enumeration attacks harder and also prevent situations where
        seeing a high valid ID value lets you estimate how much money a private
        company is earning if they charge based on the object the ID is
        associated with. If you can sample enough object ID values and see when
        the IDs were created, you could reverse engineer their ARR chart and
        see whether they're growing or not which many companies want to avoid.
       
        hippo22 wrote 23 hours 36 min ago:
        The author should include benchmarks otherwise, saying that UUIDs
        âincrease latencyâ is meaningless. For instance, how much longer
        does it take to insert a UUID vs. an integer? How much longer does
        scanning an index take?
       
          jakeydus wrote 23 hours 19 min ago:
          The author doesn't reference any tests that they themselves ran, but
          they did link a cybertec article [0] with some benchmarks.
          
          [0]
          
  HTML    [1]: https://www.cybertec-postgresql.com/en/unexpected-downsides-...
       
        deathanatos wrote 23 hours 51 min ago:
        > Are UUIDs secure?
        
        > Misconceptions: UUIDs are secure
        
        > One misconception about UUIDs is that theyâre secure. However, the
        RFC describes that they shouldnât be considered secure
        âcapabilities.â
        
        > From RFC 41221 Section 6 Security Considerations:
        
        > Do not assume that UUIDs are hard to guess; they should not be used
        as security capabilities
        
        This is just wrong, and the citation doesn't support it. You're not
        guessing a 122-bit long random identifier. What's crazy is that the
        article, immediately prior to this, even cites the very math involved
        in showing exactly how unguessable that is.
        
        â¦ the linked citation (to Â§4.4, which is different from the in-prose
        citation) is just about how to generate a v4, and completely unrelated
        to the claim. The prose citation to Â§6 is about UUIDs generally: the
        statement "Do not assume that [all] UUIDs are hard to guess" is not
        logically inconsistent with properly-generated UUIDv4s being hard to
        guess. A subset of UUIDs have security properties, if the system
        generating & using them implements those properties, but we should not
        assume all UUIDs have that property.
        
        Moreover, replacing an unguessable UUID with an (effectively random)
        32-bit integer does make it guessable, and the scheme laid out seems
        completely insecure if it is to be used in the contexts one finds
        UUIDv4s being an unguessable identifier.
        
        The additional size argument is pretty weak too; at "millions of rows",
        a UUID column is consuming an additional ~24 MiB.
       
        stickfigure wrote 1 day ago:
        This is incredibly database-specific. In Postgres random PKs are bad.
        But in distributed databases like Cockroach, Google Cloud Datastore,
        and Spanner it is the opposite - monotonic PKs are bad. You want to
        distribute load across the keyspace so you avoid hot shards.
       
          nonethewiser wrote 13 hours 29 min ago:
          >In Postgres random PKs are bad
          
          Why?
       
          whizzter wrote 15 hours 27 min ago:
          Even in a distributed database you want increasing (even if not
          monotonic) keys since the underlying b-tree or whatever will very
          likely behave badly for entirely random data.
          
          UUIDv7 is very useful for these scenarios since
          
          A: A hash or modulus of the key will be practically random due to the
          lower bits being random or pseudo-random (ie distributes well between
          nodes)
          
          B: the first bits are sortable.. thus the underlying storage on each
          node won't go bananas.
       
          dspillett wrote 15 hours 38 min ago:
          I wouldn't say it is incredibly database specific, it is more
          database type specific. For most general, non-sharded, databases, 
          random key values can be a problem as they lead to excess
          fragmentation in b-trees and similar structures.
       
          PunchyHamster wrote 16 hours 32 min ago:
          UUIDv7 is not monotonic tho
       
          HendrikHensen wrote 19 hours 46 min ago:
          In Google Cloud Bigtable we had the issue that our domain's primary
          key was a sequential integer autogenerated by another app. So we just
          reversed it, and it distributed automatically quite nicely.
       
            bruckie wrote 19 hours 41 min ago:
            Doing this (reversing the bits of sequential ids) is common
            guidance inside Google for its distributed datastores like Spanner
            and Bigtable.
       
          hedora wrote 19 hours 50 min ago:
          It's also application specific.  If you have workload that's write
          heavy, has temporal skew and is highly concurrent, but rarely creates
          new records, you're probably better off with a random PK, even in PG.
       
          kortex wrote 20 hours 45 min ago:
          As long as the key has sufficient entropy (i.e. not monotonic
          sequential ints), that ensures the keyspace is evenly distributed,
          correct? So UUID>=v4, ULID, KSUID, possibly snowflake, should be fine
          for the sake of even distribution of the hashes.
       
          nutjob2 wrote 22 hours 16 min ago:
          > You want to distribute load across the keyspace so you avoid hot
          shards.
          
          This is just another case of keys containing information and is not
          smart.
          
          The obvious solution is to have a field that drives distribution,
          allowing rebalancing or whatever.
       
            stickfigure wrote 21 hours 44 min ago:
            This is something you should discuss with the developers at
            Cockroach Labs, Google Cloud, et al.
            
            As a consumer of these databases we're stuck with them as designed,
            which means we have to worry about key distribution.
       
            conradludgate wrote 22 hours 6 min ago:
            100%. You can use rendezvous hashing to determine the shard(s). The
            hash of a sequence should be randomly distributed as changing the
            LSB should propagate to 50% change in the output bits.
       
          dap wrote 22 hours 47 min ago:
          It is, although you can have sharded PostgreSQL, in which case I
          agree with your assessment that you want random PKs to distribute
          them.
          
          It's workload-specific, too.  If you want to list ranges of them by
          PK, then of course random isn't going to work.    But then you've got
          competing tensions: listing a range wants the things you list to be
          on the same shard, but focusing a workload on one shard undermines
          horizontal scale.  So you've got to decide what you care about (or do
          something more elaborate).
       
          jakeydus wrote 23 hours 21 min ago:
          I think they address this in the article when they say that this
          advice is specific to monolithic applications, but I may be
          misremembering (I skimmed).
       
            andy_ppp wrote 23 hours 16 min ago:
            Are you saying a monolith cannot use a distributed database?
       
              jakeydus wrote 23 hours 12 min ago:
              I'm not making any claims at all, I was just adding context from
              my recollection of the article that appeared to be missing from
              the conversation.
              
              Edit: What the article said:
              > The kinds of web applications Iâm thinking of with this post
              are monolithic web apps, with Postgres as their primary OLTP
              database.
              
              So you are correct that this does not disqualify distributed
              databases.
       
        cruffle_duffle wrote 1 day ago:
        I never understood the arguments against using using globally unique
        ids. For example how it somehow messes up indexes. Iâm not a CS major
        but those are typically b-trees are they not? If you have a primary key
        whose generation is truly random such that each number is equally
        likely, then that b-tree is always going to be balanced.
        
        Yes there are different flavors of generating them with their own pros
        and cons, but at the end of the day itâs just so much more elegant
        than some auto incrementing crap your database creates. But that is
        just semantic, you can always change the uuid algorithm for future
        keys. And honestly if you treat the uuid as some opaque entity (which
        you should), why not just pick the random one?
        
        And I just thought of the argument that âbut what if you want to sort
        the uuidâ¦â say itâs used for a list of stories or something? 
        Well, againâ¦ if you treat the uuid as opaque why would you sort it? 
        You should be sorting on some other field like the date field or title
        or something. UUIDs are opaque, damn it. You donât sort opaque data. 
        âWell they get clustered weirdâ say people.  Why are you clustering
        on a random opaque key? If you need certain data to be clustered, then
        do it on the right key (user_id field did your data was to be clustered
        by user, say)
        
        Letting the client generate the primary keys is really liberating. Not
        having to care about PK collisions or leaking information via auto
        incrementing numbers is great!
        
        In my opinion uuid isnât used enough!
       
          chuckadams wrote 20 hours 39 min ago:
          > If you have a primary key whose generation is truly random such
          that each number is equally likely, then that b-tree is always going
          to be balanced.
          
          Balanced and uniformly scattered.  A random index means fetching a
          random page for every item.  Fine if your access patterns are truly
          random, but that's rarely the case.
          
          >  Why are you clustering on a random opaque key?
          
          InnoDB clusters by the PK if there is one, and that can't be changed
          (if you don't have a PK, you have some options, but let's assume you
          have one).  MSSQL behaves similarly, but you can override it.  If
          your PK is random, your clustering will be too.  In Postgres, you'll
          just get fragmented indexes, which isn't quite as bad, but still
          slows down vacuum.  Whether that actually becomes a problem is also
          going to depend on access patterns.
          
          One shouldn't immediately freak out over having a random PK, but
          should definitely at least be aware of the potential degradation they
          might cause.
       
            cruffle_duffle wrote 14 hours 23 min ago:
            I feel, honestly, like while you are indeed correct for most cases
            itâs absolutely fine to use some flavor of uuid. I feel like the
            benefits outweighs the cost in most cases.
       
              chuckadams wrote 1 hour 10 min ago:
              Sure, and for many cases, uuidv7 is that flavor.  It just comes
              with a timestamp, which may or may not be an issue.  It isn't an
              issue for me, which is why I use it myself.
       
        nesarkvechnep wrote 1 day ago:
        If we embraced REST, as Roy Fielding envisioned it, we wouldn't have
        this, and all similar, conversations. REST doesn't expose identifier,
        it only exposes relationships. Identifiers are an implementation
        details.
       
        aynyc wrote 1 day ago:
        I've seen this type of advice a few times now. Now I'm not a database
        expert by any stretch of imagination, but I have yet to see UUID as
        primary key in any of the systems I've touched.
        
        Are there valid reasons to use UUID (assuming correctly) for primary
        key? I know systems have incorrectly expose primary key to the public,
        but assuming that's not the concern. Why use UUID over big-int?
       
          rcxdude wrote 1 day ago:
          Uuids also allow the generation of the ID to seperate from the
          insertion into the database, which can be useful in distributed
          systems.
       
            cruffle_duffle wrote 1 day ago:
            I mean this is the primary reason right here!  You can pre-create
            an entire tree of relationships client side and ship it off to the
            database with everything all nice and linked up.  And since by
            design each PK is globally unique youâll never need to worry
            about constraint violations. Itâs pretty damn nice.
       
          assbuttbuttass wrote 1 day ago:
          At least for the Spanner DB, it's good to have a randomly-distributed
          primary key since it allows better sharding of the data and avoids
          "hot shards" when doing a lot of inserts. UUIDv4 is the typical
          solution, although a bit-reversed incrementing integer would work too
          
  HTML    [1]: https://cloud.google.com/blog/products/databases/announcing-...
       
          ellisv wrote 1 day ago:
          About 10 years ago I remember seeing a number of posts saying "don't
          use int for ids!". Typically the reasons were things like "the id
          exposes the number of things in the database" and "if you have bad
          security then users can increment/decrement the id to get more
          data!". What I then observed was a bunch of developers rushing to use
          UUIDs for everything.
          
          UUIDv7 looks really promising but I'm not likely to redo all of our
          tables to use it.
       
            andatki wrote 15 hours 30 min ago:
            Note that if youâre using UUID v4 now, switching to v7 does not
            require a schema migration. Youâd get the benefits when working
            with new records, for example reduced insert latency. The uuid data
            type supports both.
       
            dangets wrote 23 hours 52 min ago:
            You can use the same techniques except with the smaller int64 space
            - see e.g.  Snowflake ID -
            
  HTML      [1]: https://en.wikipedia.org/wiki/Snowflake_ID
       
          reffaelwallen wrote 1 day ago:
          At my company we only use UUIDs as PKs.
          
          Main reason I use it is the German Tank problem: [1] (tl;dr; prevent
          someone from counting how many records you have in that table)
          
  HTML    [1]: https://en.wikipedia.org/wiki/German_tank_problem
       
            infragreen wrote 22 hours 49 min ago:
            This was a great read, thank you for sharing!
       
              andatki wrote 15 hours 30 min ago:
              Appreciate it!
       
            jakeydus wrote 23 hours 13 min ago:
            I'm new to the security side of things; I can understand that
            leaking any information about the backend is no bueno, but why
            specifically is table size an issue?
       
              boruto wrote 22 hours 43 min ago:
              In my old company new joiners are assigned an monotonic number as
              id in tech. GitHub profile url reflected that.
              
              Someone may or may not have used the pattern to get to know the
              attrition rate through running a simple script every month))
       
            littlestymaar wrote 1 day ago:
            What stops you from having another uuid field as publicly visible
            identifier (which is only a concern for a minority of your tables).
            
            This way you avoid most of the issues highlighted in this article,
            without compromising your confidential data.
       
        mgoetzke wrote 1 day ago:
        That depends a lot on many factors and thus I dont like generic
        statements like that which tend to be more focused on a specific
        database pattern. That said everyone should indeed be aware of the
        potential tradeoffs.
        
        And of course we could come up with many ways to generate our own ids
        and make them unique, but we have the following requirements.
        
        - It needs to be a string (because we allow composing them to 'derive'
        keys)
        - A client must be able to create them (not just a server) without risk
        for collisions
        - The time order of keys must not be guessable easily (as the id is
        often leaked via references which could 'betray' not just the existence
        of a document, but also its relative creation time wrt others).
        - It should be easy to document how any client can safely generate
        document ids.
        
        The lookup performance is not really such a big deal for us. Where it
        is we can do a projection into a more simple format where applicable.
       
        gwbas1c wrote 1 day ago:
        I work on an application where we encrypt the integer primary key and
        then use the bytes to generate something that looks like a UUID.
        
        In our case, we don't want database IDs in an API and in URLs. When IDs
        are sequential, it enables things like dictionary attacks and provides
        estimates about how many customers we have.
        
        Encrypting a database ID makes it very obvious when someone is trying
        to scan, because the UUID won't decrypt. We don't even need a database
        round trip.
       
          sedatk wrote 20 hours 57 min ago:
          That sounds quite troublesome if the encryption key is lost,
          compromised, or rotated for any other reason.
       
            cowboylowrez wrote 14 hours 16 min ago:
            Key management seems to be as important as backups but I understand
            that something so small (an encryption key) could seem unimportant
            because database backups are so big lol but they really do share
            important attributes (do not lose your keys, do not lose your data,
            do not expose your keys, do not expose your data, etc etc)
       
            gwbas1c wrote 15 hours 57 min ago:
            The key will never be rotated. (There's no reason to.)
            
            We're not worried about key compromises.
            
            If the key is lost, we have much bigger problems.
       
          puilp0502 wrote 1 day ago:
          Few questions:
          
          * How do you manage the key for encrypting IDs? Injected to app
          environment via envvar? Just embedded in source code? I ask this
          because I'm curious as to how much "care" I should be putting in into
          managing the secret material if I were to adopt this scheme.
          
          * Is the ID encrypted using AEAD scheme (e.g. AES-GCM)? Or does the
          plain AES suffice? I assume that the size of IDs would never exceed
          the block size of AES, but again, I'm not a cryptographer so not sure
          if it's safe to do so.
       
            gwbas1c wrote 15 hours 50 min ago:
            > How do you manage the key for encrypting IDs?
            
            The same way we manage all other secrets in the application.
            (Summarized below)
            
            > Is the ID encrypted using AEAD scheme (e.g. AES-GCM)? Or does the
            plain AES suffice? I assume that the size of IDs would never exceed
            the block size of AES, but again, I'm not a cryptographer so not
            sure if it's safe to do so.
            
            I don't have the source handy at the moment. It's one of the easier
            to use symmetric algorithms available in .Net. We aren't talking
            military-grade security here. In general: a 32-bit int encrypts to
            64-bits, so we pad it with a few unicode characters so it's 64-bits
            encrypted to 128 bits.
            
            ---
            
            As far as managing secrets in the application: We have a homegrown
            configuration file generator that's adapted to our needs. It
            generates both the configuration files, and strongly-typed classes
            to read the files. All configuration values are loaded at startup,
            so we don't have to worry about runtime errors from missing
            configuration values.
            
            Secrets (connection strings, encryption keys, ect,) are encrypted
            in the configuration file as base64 strings. The certificate to
            read/write secrets are stored in Azure Keyvault.
            
            The startup logic in all applications is something like:
            
            1: Determine the environment (production, qa, dev)
            
            2: Get the appropriate certificate
            
            3: Read the configuration files, including decrypting secrets (such
            as the primary key encryption keys) from the configuration files
            
            4: Populate the strongly-typed objects that hold the configuration
            values
            
            5: These objects are dependency-injected to runtime objects
       
        stevefan1999 wrote 1 day ago:
        Try Snowflake ID then:
        
  HTML  [1]: https://en.wikipedia.org/wiki/Snowflake_ID
       
          gizzlon wrote 23 hours 37 min ago:
          Leaks time and order, just like uuid v7.
       
            0x457 wrote 21 hours 7 min ago:
            Kinda the point? Ordered PKs give you a better index.
       
        p2detar wrote 1 day ago:
        Another interesting article from Feb-2024 [0] where the cost of
        inserting a uuid7() and a bigint is basically the same. To me it wasn't
        quite clear what the problem with the buffer cache is but the author
        makes it much more clear than OP's article:
        
        > We need to read blocks from the disk when they are not in the
        PostgreSQL buffer cache. Conveniently, PostgreSQL makes it very easy to
        inspect the contents of the buffer cache. This is where the big
        difference between uuidv4 and uuidv7 becomes clear. Because of the lack
        of data locality in uuidv4 data, the primary key index is consuming a
        huge amount of the buffer cache in order to support new data being
        inserted â and this cache space is no longer available for other
        indexes and tables, and this significantly slows down the entire
        workload.
        
        0 -
        
  HTML  [1]: https://ardentperf.com/2024/02/03/uuid-benchmark-war
       
        mkleczek wrote 1 day ago:
        That's really an important deficiency of Postgres.
        
        Hash index is ideally suited for UUIDs but for some reason Postgres
        hash indexes cannot be unique.
       
        mexicocitinluez wrote 1 day ago:
        You'll have to rip the ability to generate unique numbers from quite
        literally anywhere in my app and save them without conflict from my
        cold, dead hands.
        
        The ability to know ahead of time what a primary key will be (in lieu
        of persisting it first, then returning) opened up a whole new world of
        architecting work in my app. It made a lot of previous awkward things
        feel natural.
       
          sgarland wrote 1 day ago:
          Sounds like a lot of referential integrity violations.
       
            mexicocitinluez wrote 1 day ago:
            Why would generating a PK ahead of time cause referential integrity
            violations? Super curious to find out.
       
              sgarland wrote 1 day ago:
              The implication is that you need to know the PK ahead of time so
              that you can insert it into other tables which reference it as an
              FK without waiting for it to be returned, which further implies
              that you donât have FK constraints, because the DB would
              disallow this.
              
              Tbf in Postgres, you can declare FKs to be deferrable, so their
              existence is checked at transaction commit, rather than at
              insertion time.
              
              If you donât have the DB enforcing referential integrity, you
              need to be extremely careful in your application logic; IME, this
              inevitably fails. At some point, someone writes bad code, and you
              get data anomalies.
       
                mexicocitinluez wrote 23 hours 22 min ago:
                > Tbf in Postgres, you can declare FKs to be deferrable, so
                their existence is checked at transaction commit, rather than
                at insertion time.h further implies that you donât have FK
                constraints, because the DB would disallow this.
                
                I'm using EF core which hooks up these relationships and allows
                me to persist them in a single transaction using MSSQL server.
                
                > If you donât have the DB enforcing referential integrity
                
                I'm building an electronic medical system. I'm well aware of
                the benefits of referential integrity.
       
        old8man wrote 1 day ago:
        Very useful article, thank you! Many people suggest CUID2, but it is
        less efficient and is better used for frontend/url encoding. For
        backend/db, only UUID v7 should be used.
       
        kaladin_1 wrote 1 day ago:
        I really hoped the author would discuss alternatives for distributed
        databases that writes in parallel. Sequential key would be atrocious in
        such circumstance this could kill the whole gain of distributed
        database as hotspots would inevitably appear.
        
        I would like to hear from others using, for example, Google Spanner, do
        you have issues with UUID. I don't for now, most optimizations happen
        at the Controller level, data transformation can be slow due to
        validations. Try to keep service logic as straightforward as possible.
       
        grugdev42 wrote 1 day ago:
        A much simpler solution is to keep your tables as they are (with an
        integer primary key), but add a non sequential public identifier too.
        
        id => 123,
        public_id => 202cb962ac59075b964b07152d234b70
        
        There are many ways to generate the public_id. A simple MD5 with a salt
        works quite well for extremely low effort.
        
        Add a unique constraint on that column (which also indexes it), and
        you'll be safe and performant for hundreds of millions of rows!
        
        Why do we developers like to overcomplicate things? ;)
       
          Retr0id wrote 1 day ago:
          Per [1] , if you use a block cipher rather than a hash then you don't
          even need to store it anywhere.
          
  HTML    [1]: https://news.ycombinator.com/item?id=46273325
       
          Denvercoder9 wrote 1 day ago:
          This misses the point. The reason not to use UUIDv4 is that having an
          index on random values is slow(er), because sequential inserts into
          the underlying B-tree are faster than random inserts. You're hitting
          the same problem with your `public_id` column, that it's not the
          primary key doesn't change that.
       
            sgarland wrote 1 day ago:
            For InnoDB-based DBs that are not Aurora, and if the secondary
            index isnât UNIQUE, it solves the problem, because secondary
            non-unique index changes are buffered and written in batches to
            amortize the random cost. If youâre hashing a guaranteed unique
            entity, Iâd argue you can skip the unique constraint on this
            index.
            
            For Aurora MySQL, it just makes it worse either way, since
            thereâs no change buffer.
       
            hnfong wrote 1 day ago:
            Ints as pk would be quicker for joins etc though.
       
              grugdev42 wrote 4 hours 31 min ago:
              Exactly
       
        hk1337 wrote 1 day ago:
        My biggest thing for UUIDs is donât UUID everything. Most things
        should be okay with just regular integers as PKs.
       
        BartjeD wrote 1 day ago:
        Personally my approach has been to start with big-ints and add a GUID
        code field if it becomes necessary. And then provide imports where you
        can match objects based on their code, if you ever need to
        import/export between tenants, with complex object relationships.
        
        But that also adds complexity.
       
          parpfish wrote 1 day ago:
          Two things I donât like about big-int indexes:
          
          - If you use uuids as foreign keys to another table, itâs obvious
          when you screw up a join condition by specifying the wrong indices.
          With int indices you can easily get plausible looking results because
          your join will still return a bunch of data
          
          - if youâre debugging and need to search logs, having a simple uuid
          string is nice for searching
       
        ivan_gammel wrote 1 day ago:
        The is article is about a solution in search of a problem, a classic
        premature optimization issue. UUIDv4 is perfectly fine for many use
        cases, including small databases. Performance argument must be
        considered when thereâs a problem with performance on the horizon.
        Other considerations may be and very often superior to that.
       
          sgarland wrote 1 day ago:
          IME, when performance issues become obvious, the devs are in growth
          mode and have no desire / time to revisit PK choice.
          
          Integer PKs were seen as fine for years - decades, even - before the
          rise of UUIDs.
       
          sagarm wrote 1 day ago:
          It's not really feasible to rekey your UUIDv4 keyed database to
          int64s after the fact, imo. Sure your new tables could be
          integer-keyed, but the bulk of your storage will be UUID (and UUIDv4,
          if that's what you started with) for a very long time
       
            mrkeen wrote 17 hours 0 min ago:
            If you have too many UUIDs, throw more DBs at the problem. 
            Hopefully you haven't tied yourself to any architectural decisions
            that would limit you to only using a single database.
       
            ivan_gammel wrote 1 day ago:
            Yes, sure. My point is, it may never be necessary.
       
              sagarm wrote 23 hours 7 min ago:
              I think you're right that it won't matter for most companies. But
              having been at a company with persistent DB performance issues
              with UUIDv4 keys as a contributing factor, it sucks.
       
        vintermann wrote 1 day ago:
        A prime example of premature optimization.
        
        Permanent identifiers should not carry data. This is like the cardinal
        sin of data management. You always run into situations where the thing
        you thought, "surely this never changes, so it's safe to squeeze into
        the ID to save a lookup". Then people suddenly find out they have a new
        gender identity, and they need a last final digit in their ID numbers
        too.
        
        Even if nothing changes, you can run into trouble. Norwegian PNs have
        your birth date (in DDMMYY format) as the first six digits. Surely that
        doesn't change, right? Well, wrong, since although the date doesn't
        change, your knowledge of it might. Immigrants who didn't know their
        exact date of birth got assigned 1. Jan by default... And then people
        with actual birthdays on 1 Jan got told, "sorry, you can't have that as
        birth date, we've run out of numbers in that series!"
        
        Librarians in the analog age can be forgiven for cramming data into
        their identifiers, to save a lookup. When the lookup is in a physical
        card catalog, that's somewhat understandable (although you bet they
        could run into trouble over it too). But when you have a powerful
        database at your fingertips, use it! Don't make decisions you will
        regret just to shave off a couple of milliseconds!
       
          maxbond wrote 16 hours 33 min ago:
          > Permanent identifiers should not carry data.
          
          Do you have the same criticism for serial identifiers? How about
          hashes? What about the version field in UUIDs?
       
          danudey wrote 16 hours 36 min ago:
          Perhaps you can clarify something for me, because I think I'm missing
          it.
          
          > Norwegian PNs have your birth date (in DDMMYY format) as the first
          six digits
          
          So presumably the format is DDMMYYXXXXX (for some arbitrary number of
          X's), where the XXX represents e.g. an automatically incrementing
          number of some kind?
          
          Which means that if it's DDMMYYXXX then you can only have 1000 people
          born on DDMMYY, and if it's DDMMYYXXXXX then you can have 100,000
          people born on DDMMYY.
          
          So in order for there to be so many such entries in common that
          people are denied use of their actual birthday, then one of the
          following must be true:
          
          1. The XXX counter must be extremely small, in order for it to run
          out as a result of people 'using up' those Jan 1 dates each year
          
          2. The number of people born on Jan 1 or immigrating to Norway
          without knowledge of their birthday must be colossal
          
          If it was just DDMMXXXXX (no year) then I can see how this system
          would fall apart rapidly, but when you're dealing with specifically
          "people born on Jan 1 2014 or who immigrated to Norway and didn't
          know their birthday and were born on/around 2014 so that was the year
          chosen" I'm not sure how that becomes a sufficiently large number to
          cause these issues. Perhaps this only occurs in specific years where
          huge numbers of poorly-documented refugees are accepted?
          
          (Happy to be educated, as I must be missing something here)
       
          PunchyHamster wrote 16 hours 43 min ago:
          The cause is more just "not having enough bits". UUID is 128 bit.
          You're not running out even if you use part for timestamp, the random
          part will be big enough.
          
          Like, it's a valid complaint.. just not for discussion at hand.
          
          Also, we do live in reality and while having entirely random one
          might be perfect from theory of data, in reality having it be
          prefixed by date have many advantages performance wise.
          
          > Permanent identifiers should not carry data. This is like the
          cardinal sin of data management
          
          As long as you don't use the data and have actual fields for what's
          also encoded in UUID, there is absolutely nothing wrong with it,
          provided there is enough of the random part to get around artifacts
          in real life data.
       
          asah wrote 18 hours 3 min ago:
          counterpoint: IRL, data values in a system like PostgreSQL are padded
          to word boundaries so either you're wasting bits or "carrying data."
       
          cycomanic wrote 21 hours 7 min ago:
          Like the other poster said, this is a problem with default values not
          encoding the birthday into the personnummer.
          
          I think it also is important to remember the purpose of specific
          numbers. For instance I would argue a PN without the birthday would
          be strictly worse. With the current system (I only know the Swedish
          one, but assume it's the same) I only have to remember a 4 digit
          (because the number is bdate + unique 4 digits). If we would instead
          use completely random numbers I would have to remember at least an 8
          digit number (and likely to be future proof you'd want at least 9
          digits). Sure that's fine for myself (although I suspect some people
          already struggle with it), but then I also have to remember the
          numbers for my 2 kids and my partner and things become quickly
          annoying. Especially, because one doesn't use the numbers often
          enough that it becomes easy, but still often enough that it becomes
          annoying to look up, especially when one doesn't always cary their
          phone with them.
       
            guhcampos wrote 20 hours 43 min ago:
            It's not that bad. Brazilian CPF are 11 numbers and everyone
            remembers them. You just get use to it =)
       
          GuB-42 wrote 21 hours 10 min ago:
          I don't think the timestamped UUIDs are "carrying data", it is just a
          heuristic to improve lookup performance. If the timestamp is wrong,
          it will just run as slow as the non-timestamped UUID.
          
          If you take the gender example, for 99% of people, it is male/female
          and it won't change, and you can use that for load balancing. But if
          later, you found out that the gender is not the one you expect for
          that bucket, no big deal, it will cause a branch misprediction, but
          instead of happening 50% of the times when you use a random value, it
          will only happen 1% of the times, significant speedup with no loss in
          functionality.
       
            delecti wrote 18 hours 59 min ago:
            As soon as you encode imperfect data in an immutable key, you
            always have to check when you retrieve it. If that piece of data
            isn't absolutely 100% guaranteed to be perfect, then you have to
            query both halves of the load balanced DB anyway.
       
            moralestapia wrote 20 hours 52 min ago:
            >and you can use that for load balancing
            
            As long as you're not in China or India around specific years ...
            
            GP's point stands strong.
       
          mzi wrote 21 hours 26 min ago:
          > Even if nothing changes, you can run into trouble. Norwegian PNs
          have your birth date (in DDMMYY format) as the first six digits.
          Surely that doesn't change, right?
          
          I guess that Norway has solved it in the same or similar way as
          Sweden? So a person is identified by the PNR and for those systems
          that need to track a person over several PNR (government agencies)
          use PRI. And a PRI is just the first PNR assigned to a person with a
          1 inserted in the middle. If that PRI is occupied, use a 2,and so on.
          
          PRI could of course have been a UUID instead.
       
          ralferoo wrote 22 hours 49 min ago:
          > Norwegian PNs have your birth date (in DDMMYY format) as the first
          six digits. Surely that doesn't change, right? Well, wrong, since
          although the date doesn't change, your knowledge of it might.
          Immigrants who didn't know their exact date of birth got assigned 1.
          Jan by default... And then people with actual birthdays on 1 Jan got
          told, "sorry, you can't have that as birth date, we've run out of
          numbers in that series!"
          
          To me, what your example really shows is the problem with incorrect
          default values, not a problem with encoding data into a key per se.
          If they'd chosen a non-date for unknown values, maybe 00 or 99 for
          day or month components, then the issue you described would
          disappear.
          
          But in case, the intention for encoding a timestamp into a UUID isn't
          for any implied meaning. It's both to guarantee uniqueness with a
          side effect that IDs are more or less monotonically increasing.
          Whether this is actually desirable depends on your application, but
          generally if the application is as a indexed key for insertion into a
          database, it's usually more useful for performance than a fully
          random ID as it avoids rewriting lots of leaf-nodes of B-trees. If
          you insert a load of these such keys, it forms a cluster on one side
          of the tree that can the rebalance with only the top levels needing
          to be rewritten.
       
            PunchyHamster wrote 16 hours 47 min ago:
            > To me, what your example really shows is the problem with
            incorrect default values, not a problem with encoding data into a
            key per se. If they'd chosen a non-date for unknown values, maybe
            00 or 99 for day or month components, then the issue you described
            would disappear.
            
            well, till you run out of numbers for the immigrants that don't
            have exact birth date
       
            croes wrote 20 hours 17 min ago:
            > If they'd chosen a non-date for unknown values, maybe 00 or 99
            for day or month components, then the issue you described would
            disappear
            
            > Norwegian PNs have your birth date (in DDMMYY format) as the
            first six digits.
            
            You can already feel the disaster rising because sone program
            expects always the latter.
            
            And it doesnât fix the problem, it just makes it less likely.
       
            nonethewiser wrote 20 hours 40 min ago:
            >To me, what your example really shows is the problem with
            incorrect default values, not a problem with encoding data into a
            key per se. If they'd chosen a non-date for unknown values, maybe
            00 or 99 for day or month components, then the issue you described
            would disappear.
            
            You still have that problem from organic birthdays and also the
            problem of needing to change ids to correct birth dates.
       
              jonny_eh wrote 18 hours 14 min ago:
              To add to that, birthdays can clump, just like any seemingly
              "random" data.
       
                Dylan16807 wrote 16 hours 44 min ago:
                Not significantly.  For actual births, a couple holidays have
                very low rates but clumping into much higher rates happens on
                no dates.
       
                lovich wrote 17 hours 29 min ago:
                A million dots scattered randomly over a graph can all land on
                the exact same coordinate if itâs truly random.
                
                What most people intuit as random is some sort of noise
                function that is generally dispersed and doesnât trigger the
                pattern matching part of their brain
       
                  Dylan16807 wrote 16 hours 41 min ago:
                  > A million dots scattered randomly over a graph can all land
                  on the exact same coordinate if itâs truly random.
                  
                  It won't happen though.  0.00000000% chance it happens even
                  once in a trillion attempts.
                  
                  > What most people intuit as random is some sort of noise
                  function that is generally dispersed and doesnât trigger
                  the pattern matching part of their brain
                  
                  Yes, people intuit the texture of random wrong in a situation
                  where most buckets are empty.  But when you have orders of
                  magnitude more events than buckets, that effect doesn't
                  apply.    You get pretty even results that people expect.
       
                    lovich wrote 15 hours 35 min ago:
                    > It won't happen though. 0.00000000% chance it happens
                    even once in a trillion attempts.
                    
                    It has the same odds as any other specific configuration of
                    randomly assigned dots. The overly active human pattern
                    matching behavior is the only reason it would be treated as
                    special.
       
                      FartinMowler wrote 13 hours 2 min ago:
                      Lol, reminds me of a story: at his workplace my brother
                      was invited to join a lottery ticket pool where each got
                      to pick the numbers for a ticket. The numbers he picked
                      were 1-2-3-4-5-6. Although the others, mostly fellow
                      engineers, reluctantly agreed his numbers were as likely
                      as the others, after a couple of weeks they neglected to
                      invite him again.
       
                      coldtea wrote 14 hours 29 min ago:
                      >It has the same odds as any other specific configuration
                      of randomly assigned dots
                      
                      Which doesn't change anything in practice, since it
                      having "the same odds as any other specific
                      configuration" ignores the fact that more scattered
                      configurations are still far more numerous than it (or
                      even from ones with more visual order in general) taken
                      all together.
                      
                      >The overly active human pattern matching behavior is the
                      only reason it would be treated as special.
                      
                      Nope, it's also the fact that it is ONE configuration,
                      whereas all the rest are much much larger number. That's
                      enough to make this specific configuration ultra rare in
                      comparison (since we don't compare it to each other but
                      to all others put together).
       
                      Dylan16807 wrote 14 hours 46 min ago:
                      Entropy says it's special.  If you have a million dots
                      and 10,000 coordinates, you have 10,000 ways for all the
                      dots to land in the same coordinate, and a zillion
                      kavillion stupillion ways to have somewhere near 100 dots
                      in each coordinate.
       
                        lovich wrote 13 hours 11 min ago:
                        No, if its randomly distributed then every specific
                        configuration has the same exact chance of happening.
                        
                        I am laughing at all the people coming out of the
                        woodwork to reply to my original post in this thread
                        misunderstanding randomness and chance.
                        
                        If you flip a coin a million times and it lands on head
                        every single time, the millionth and 1 time still has a
                        50/50 chance of landing on heads
       
                          Dylan16807 wrote 12 hours 49 min ago:
                          > every specific configuration
                          
                          Who said anything about specific configurations?
                          
                          We started this talking about whether things "clump"
                          or not.  The result depends on your definition of
                          "clump" but let's say it involves a standard
                          deviation.  Different standard deviations have wildly
                          different probabilities, even when every specific
                          configuration has the same probability.
                          
                          Nobody responding to you is calculating things wrong.
                           We're talking about the shape of the data. 
                          Categories.  And those categories are different
                          sizes, because they have different numbers of
                          specific configurations in them.
                          
                          > the millionth and 1 time
                          
                          I don't see any connection between the above
                          discussion and the gambler's fallacy?
       
            OptionOfT wrote 21 hours 1 min ago:
            Belgium's national register number is similar:
            
            YY.MM.DD-AAA.BB
            
            In either the AAA or BB component there is something about the
            gender.
            
            But it does mean that there is a limit of people born per day of a
            certain gender.
            
            But for a given year, using a moniker will only delay the
            inevitable. Sure, there are more numbers, but still limited as
            there are SOME parts that need to reflect reality. Year, gender (if
            that's still the case?) etc.
       
              hyperman1 wrote 19 hours 38 min ago:
              BB is a mod-97 checksum.  The first A of AAA encodes your gender
              in an even/odd fashion, I forgot if its the first or last A doing
              that.
              MM or DD can be 00 if unknown. Also MM has +20 or +40 in some
              cases.
              
              If you know someones birth date and gender, the INSZ is almost
              certainly 1 in 500 numbers, with a heavy skew to the lower AAA. 
              Luckily, you can't do much damage with someones number,unlike an
              USA SSN (but I'd still treat it confidential).
       
                notpushkin wrote 14 hours 7 min ago:
                > I'd still treat it confidential
                
                Estonian isikukood is GYYMMDDNNNC, and is relatively public.
                You can find mine pretty easily if you know where to look (no
                spoilers!). Itâs relatively harmless.
                
                Kazakh IIN is YYMMDDNNNNNN (where N might have some structure)
                and is similarly relatively public: e.g. if youâre a sole
                proprietor, chances are you have to hang your license on the
                wall, which will have it.
                
                Itâs a bit more serious: Iâve got my mail at the post
                office by just showing a barcode of my IIN to the worker. They
                usually scan it from an ID, which I donât have, but Iâve
                figured out the format and created a .pkpass of my own. Zero
                questions â hereâs your package, no we donât need your
                passport either, have a nice day!
                
                (Tangential, but Kazakhs also happen to have the most peculiar
                post office layout: it looks exactly like a supermarket, where
                you go in, find your packages (sorted by the tracking number,
                IIRC), and go to checkout. Iâve never seen it anywhere else.)
       
            tyre wrote 22 hours 37 min ago:
            And then have to enter/handle a non-date through all systems? How
            do you know if this non-dated person is over the age of minority?
            Eligible for a pension?
            
            Maybe the answer is to evenly spread the defaults over 365 days.
       
              ralferoo wrote 22 hours 22 min ago:
              If you don't know their birthday, you can presumably never answer
              that question in any case.
              
              If you only know the birth year and keyed 99 as the month for
              unknown, then your algorithm would determine they were of a
              correct age on the start of the year after that was true, which I
              guess would be what you want for legal compliance.
              
              If you don't even know if the birth year is correct, then the
              correct process depends on policy. Maybe they choose any year,
              maybe they choose the oldest/youngest year they might be, maybe
              they just encode that as 0000/9999.
              
              Again, if you don't know the birth year of someone, you would
              have no way of knowing their age. I'm not sure that means that
              the general policy of putting a birthday into their ID number is
              flawed.
              
              Many governments re-issue national IDs to the same person with
              different numbers, which is far less problematic that the many
              governments who choose to issue the same national ID (looking at
              you USA with your SSN) to multiple individuals. It doesn't seem
              like a massive imposition on a person who was originally issued
              an ID based on not knowing when their birthday to be re-issued a
              new ID when their birthday was ascertained. Perhaps even give
              them a choice of keeping the old one knowing it will cause
              problems, or take the new one instead and having the
              responsibility to tell people their number had changed.
              
              Presumably the governments that choose to embed the date into a
              national ID number do so because it's more useful for their
              purposes to do so than just assigning everyone a random number.
       
                notpushkin wrote 14 hours 36 min ago:
                > or take the new one instead and having the responsibility to
                tell people their number had changed
                
                Or have the opportunity to scam people into thinking youâre a
                different person. (E.g. take a $1M loan, go bankrupt, remember
                your birthday, and take a loan again.)
       
          benterix wrote 23 hours 42 min ago:
          Your comment is valid but is not related to the article.
       
            spoiler wrote 23 hours 13 min ago:
            More broadly, this is the ages old surrogate vs natural key
            discussion, but yes the comment completely misses the point of the
            article. I can only assume they didn't read it in full!
       
              vintermann wrote 23 hours 4 min ago:
              The article explicitly argues against the use of GUIDs as primary
              keys, and I'm arguing for it.
              
              A running number also carries data. Before you know it, someone's
              relying on the ordering or counting on there not being gaps - or
              counting the gaps to figure out something they shouldn't.
       
                michaelt wrote 18 hours 54 min ago:
                > A running number also carries data. Before you know it,
                someone's relying on the ordering or counting on there not
                being gaps - or counting the gaps to figure out something they
                shouldn't.
                
                For example, if [1] can be seen but [2] can't, someone might
                infer the existence of a hidden issue relating to a critical
                security problem.
                
                Whereas if the URL was instead [3] that risk would be avoided.
                
  HTML          [1]: https://github.com/pytorch/pytorch/issues/111111
  HTML          [2]: https://github.com/pytorch/pytorch/issues/111110
  HTML          [3]: https://github.com/pytorch/pytorch/issues/761500e0-007...
       
                konart wrote 21 hours 37 min ago:
                >Before you know it, someone's relying
                
                Do not expose your internal IDs. As simple as that.
       
                  dpark wrote 21 hours 24 min ago:
                  This came up in the last two threads I read about uuidv7.
                  
                  This is simply not a meaningful statement. Any ID you expose
                  externally is also an internal ID.  Any ID you do not expose
                  is internal-only.
                  
                  If you expose data in a repeatable way, you still have to
                  choose what IDs to expose, whether thatâs the primary key
                  or a secondary key. (In some cases you can avoid exposing
                  keys at all, but those are narrow cases.)
       
                    konart wrote 20 hours 55 min ago:
                    You have one ID as a primary key. It is used for building
                    relations in your database.
                    
                    The second ID has nothing to do with internal structure of
                    your data. It is just another field.
                    
                    You can change your structure however you want (or type of
                    your "internal" IDs) and you don't have to worry about an
                    external consumer. They still get their artificial ID.
       
                      dpark wrote 20 hours 43 min ago:
                      So what you meant is not to expose the primary key?
                      
                      Thatâs a more reasonable statement but I still donât
                      agree. This feels like one of those âbest practicesâ
                      that people apply without thinking and create pointless
                      complexity.
                      
                      Donât expose your primary key if there is a reason to
                      separate your primary key from the externally-exposed
                      key. If your primary key is the form that you want to
                      expose, then you should just expose the primary key. e.g.
                      If your primary key is a UUID, and you create a separate
                      UUID just to expose publicly, you have most likely added
                      useless complexity to your system.
       
                        whynotminot wrote 16 hours 28 min ago:
                        > create pointless complexity
                        
                        My exact thought.
                        
                        A lot else has failed in your system, from access
                        control to API design, if this becomes a problem.
                        Security by obscurity isnât the answer.
                        
                        If the only thing between an attacker and your DB is
                        that they canât guess the IDs youâre already in
                        some serious trouble.
       
                benterix wrote 22 hours 36 min ago:
                > The article explicitly argues against the use of GUIDs as
                primary keys, and I'm arguing for it.
                
                Let's clarify things.
                
                The author argues against UUIDv4 as primary keys when compared
                to integers or bigints in large databases for performance
                reasons.
                
                The examples you give refer to the common mistake of using a
                non-unique attribute that can be changed for a given entity as
                a primary key.
       
          scottlamb wrote 23 hours 46 min ago:
          > Permanent identifiers should not carry data.
          
          I think you're attacking a straw man. The article doesn't say
          "instead of UUIDv4 primary keys, use keys such as birthdays with
          exposed semantic meaning". On the contrary, they have a section about
          how to use sequence numbers internally but obfuscated keys
          externally. (Although I agree with dfox's and formerly_proven's
          comments [1, 2] that XOR method they proposed for this is terrible.
          Reuse of a one-time pad is probably the most basic textbook example
          of bad cryptography. They referred to the values as "obfuscated" so
          they probably know this. They should have just gone with a better
          method instead.) [1]
          
  HTML    [1]: https://news.ycombinator.com/item?id=46272985
  HTML    [2]: https://news.ycombinator.com/item?id=46273325
       
            patmorgan23 wrote 22 hours 12 min ago:
            Insert order or time is information. And if you depend on that
            information you are going to be really disappointed when back dated
            records have to be inserted.
       
              scottlamb wrote 21 hours 54 min ago:
              Right, to ensure your clients don't depend on that information,
              make the key opaque outside the database through methods such as
              the ones dfox and formerly_proven suggested, as I said.
       
            naasking wrote 23 hours 29 min ago:
            I don't think the objection is that it exposes semantic meaning,
            but that any meaningful information is contained within the key at
            all, eg. even a UUID that includes timestamp information about when
            it was generated is "bad" in a sense, as it leaks information.
            Unique identifiers should be opaque and inherently meaningless.
       
              scottlamb wrote 22 hours 49 min ago:
              Your understanding is inconsistent with the examples in
              vintermann's comment. Using a sequence number as an internal-only
              surrogate key (deliberately opaqued when sent outside the bounds
              of the database) is not the same as sticking gender identity,
              birth date, or any natural properties of a book into a broadly
              shared identifier.
       
                naasking wrote 20 hours 43 min ago:
                No it's not, they very explicitly clarify in follow-up comments
                that unique identifiers should not be embedded any kind of
                meaningful content. See: [1]
                
  HTML          [1]: https://news.ycombinator.com/item?id=46276995
  HTML          [2]: https://news.ycombinator.com/item?id=46273798
       
                  scottlamb wrote 20 hours 41 min ago:
                  Okay, but they ignore the stuff I was talking about,
                  consistent with my description of this as a straw man attack.
                  
                  > A running number also carries data. Before you know it,
                  someone's relying on the ordering or counting on there not
                  being gaps - or counting the gaps to figure out something
                  they shouldn't.
                  
                  The opaquing prevents that.
                  
                  They also describe this as a "premature optimization". That's
                  half-right: it's an optimization. Having the data to support
                  an optimization, and focusing on optimizing things that are
                  hard to migrate later, is not premature.
       
          sgarland wrote 1 day ago:
          > Permanent identifiers should not carry data.
          
          Did you read the article? He doesnât recommend natural keys, he
          recommends integer-based surrogates.
          
          > A prime example of premature optimization.
          
          Disagree. Data is sticky, and PKs especially so. Moreover, if
          youâre going to spend time optimizing anything early on, it should
          be your data model.
          
          > Don't make decisions you will regret just to shave off a couple of
          milliseconds!
          
          A bad PK in some databases (InnoDB engine, SQL Server if clustered)
          can cause query times to go from sub-msec to tens of msec quite
          easily, especially with cloud solutions where storage isnât
          node-local. I donât just mean a UUID; a BIGINT PK on a 1:M can
          destroy your latency for the simple reason of needing to fetch a
          separate page for every record. If instead the PK is a composite of
          (, id) - e.g. (user_id, id) - where id is a monotonic integer,
          youâll have WAY better data locality.
          
          Postgres suffers a different but similar problem with its visibility
          map lookups.
       
            liuliu wrote 20 hours 10 min ago:
            I read it (and regret it is a waste of my time). Their arguments
            are:
            
            * integer keys are faster;
            
            * uuidv7 keys are faster;
            
            * if you want obfuscated keys, using integer and do some your own
            obfuscation (!!!).
            
            I can get on-board of uuidv7 (with the trade-off, of course, on
            stronger guessability). The integer keys argument is strange. At
            that point, you need to come up with a custom-built system to avoid
            id collision in a distribution system and tries to achieve only 2x
            saving (the absolute minimal you should do is 64-bit keys). Very
            puzzling suggestion and to me very wrong.
            
            Note that in this entire article, the recommendation is not about
            using natural keys (email address, some composite of user
            identification etc.), so I am skipping that whole discussion.
       
              sgarland wrote 15 hours 55 min ago:
              You can hand out chunks of sequential ids from a central
              coordinator to avoid collision; this is a well-established
              pattern.
              
              Re: natural keys (or something like it), I was using it as an
              example of how badly PK choice can impact performance at scale.
       
              PunchyHamster wrote 16 hours 33 min ago:
              > (with the trade-off, of course, on stronger guessability).
              
              you're not guessing 2^72 bit random number. And if guessing UUID
              does something in your app, you already fucked up
       
            lukeschlather wrote 23 hours 19 min ago:
            > Did you read the article? He doesnât recommend natural keys, he
            recommends integer-based surrogates.
            
            I am not a cryptographer, but I would want his recommendation
            reviewed by a cryptographer. And then I would have to implement it.
            UUIDs have been extensively reviewed by cryptographers, I have a
            variety of excellent implementations I can use, I know they solve
            the problem well. I know they can cause performance issues; they're
            a security feature that is easy to implement, and I can deal with
            the performance issues if and when they crop up. (Which, in my
            experience, it's unusual. Even at a large company, most databases I
            encounter do not have enough data. I will err on the side of
            security until it becomes a problem, which is a good problem to
            have.)
       
              alerighi wrote 22 hours 53 min ago:
              Why they are a security feature? They are not, the article even
              says it. Even if UUID4 are random, nobody guarantees that they
              are generated with a cryptographically secure random number
              generator, and in fact most implementations don't!
              
              The reason why in a lot of context you use UUID is when you have
              a distributed system where you want your client to decide the ID
              that is then stored in multiple systems that not communicate.
              This is surely a valid scenario for random UUID.
              
              To me the rule is use UUID as a customer-facing ID for things
              that has to have an identity (e.g. a user, an order, etc) and
              expose it publicly through APIs, use integer ID as internal
              identifier that are used to create relations between entities,
              and interal IDs are always kept private. That way numeric ID that
              are more efficient remain inside the database and are used for
              joining data, UUID is used only for accessing the object from an
              API (for example) but then internally when joining (where you
              have to deal with a lot of rows) you can use the more efficient
              numeric ID.
              
              By the way, I think that the thing of "using UUID" came from
              NoSQL databases, where surely you use an UUID, but also you don't
              have to join data. People than transposed a best practice in one
              scenario to SQL, where its not really that best practice...
       
                lukeschlather wrote 22 hours 17 min ago:
                If a sequential ID is exposed to the client, the client can
                trivially use it to determine the number of records and the
                relative age of any records. UUID solves this, and the use of a
                cryptographically secure number generator isn't really
                necessary for it to solve this. The author's scheme might be
                similarly effective, but I trust UUIDs to work well. There are
                obviously varying ways to hide this information other than
                UUIDs, but UUIDs are simple and I don't have to think about it,
                I just get the security benefits. I don't have to worry about
                not exposing IDs to the clients, I can do it freely.
       
                  sgarland wrote 19 hours 42 min ago:
                  I have never seen anyone post an actual example of the German
                  Tank problem creating an issue for them, only that itâs
                  possible.
                  
                  > I donât have to think about it
                  
                  And here we have the main problem of most DB issues I deal
                  with on a daily basis - someone didnât want to think about
                  the implications of what they were doing, and itâs suddenly
                  then my emergency because they have no idea how to address
                  it.
       
                    mrkeen wrote 10 hours 30 min ago:
                    Here you go: [1] [2025-12-10]
                    
                    > We generate every valid 7-digit North American phone
                    number, then for every area code, send every number in
                    batches of 40000
                    
                    > Time to go do something else for a while. Just over 27
                    hours and one ill-fated attempt at early season ski touring
                    later, the script has finished happily, the logfile is full
                    of entries, and no request has failed or taken longer than
                    3 seconds. So much for rate limiting. Weâve leaked every
                    Freedom Chat userâs phone number
                    
  HTML              [1]: https://news.ycombinator.com/item?id=46279123
       
                    lukeschlather wrote 16 hours 27 min ago:
                    If you can predict user IDs this is extremely useful when
                    you're trying to come up with an exploit that might create
                    a privileged user, or perhaps you can create some object
                    you have access to that is owned by users that will be
                    created in the near future.
                    
                    When I say "I don't have to think about it" I mean I don't
                    have to think about the ways an attacker might be able to
                    predict information about my user ids which they could use
                    to gain access to accounts, because I know they cannot
                    predict information about user ids.
                    
                    You are dismissing the implications of using something that
                    is less secure than UUIDs and you haven't convinced me I'm
                    the one failing to think through the implications. I know
                    there are performance problems, I know they might require
                    some creative solutions. I am not worried about
                    unpredictable performance issues, I am worried about
                    unpredictable security problems.
       
                      sgarland wrote 15 hours 58 min ago:
                      Perhaps this is my bias coming through. I work with DBs
                      day in and day out, and the main problem I face is
                      performance from poorly-designed schemas and queries;
                      next largest issue is referential integrity violations
                      causing undefined behavior. The security issues Iâve
                      found were all people doing absurdly basic stuff, like
                      exposing an endpoint that dumped passwords.
                      
                      To me, if youâre relying on having a matching PK as
                      security, something has already gone wrong. There are
                      ways to provide AuthN and AuthZ other than that. And yes,
                      âdefense in depth,â but if your base layer is âwe
                      have unguessable user ids,â IME people will become
                      complacent, and break it somewhere else in the stack.
       
          barrkel wrote 1 day ago:
          Uuid v7 just has a bias in its generation; it isn't carrying
          information. You're not going to try and extract a timestamp from a
          uuid.
          
          Random vs time biased uuids are not a decision to shave off ms that
          you will regret.
          
          Most likely they will be a decision that shaves off seconds (yes,
          really - especially when you consider locality effects) and you'll
          regret nothing.
       
            ncruces wrote 18 hours 8 min ago:
            > You're not going to try and extract a timestamp from a uuid.
            
            So, random library: [1] > Time returns the time in 100s of
            nanoseconds since 15 Oct 1582 encoded in uuid. The time is only
            defined for version 1, 2, 6 and 7 UUIDs.
            
  HTML      [1]: https://pkg.go.dev/github.com/google/uuid#UUID.Time
       
            duckerude wrote 22 hours 39 min ago:
            I've worked on a system where ULIDs (not UUIDv7, but similar) were
            used with a cursor to fetch data in chronological order and
            thenâsurprise!âone day records had to be backdated, meaning
            that either the IDs for those records had to be counterfeited
            (potentially violating invariants elsewhere) or the fetching had to
            be made smarter.
            
            You can choose to never make use of that property. But it's
            tempting.
       
              voidnap wrote 20 hours 52 min ago:
              I made a service using something like a 64 bit wide ULID but
              there was never a presumption that data is be inserted or updated
              earlier than the most recent record.
              
              If the domain is modeling something like external events (in my
              case), and that external timestamp is packed into your primary
              key, and you support receiving events out of chronological order,
              then it just follows that you might insert stuff ealrier than you
              latest record.
              
              You're gonna have problems "backdating" if you mix up time of
              insertion with when the event you model actually ocurred. Like id
              you treat those as the same thing when they aren't.
       
            jandrewrogers wrote 23 hours 46 min ago:
            > You're not going to try and extract a timestamp from a uuid.
            
            Hyrum's Law suggests that someone will.
       
            tobyhinloopen wrote 1 day ago:
            > You're not going to try and extract a timestamp from a uuid.
            
            I totally used uuidv7s as "inserted at" in a small project and I
            had methods to find records created between two timestamps that
            literally converted timestamps to uuidv7 values so I could do
            "WHERE id BETWEEN a AND b"
       
            bri3d wrote 1 day ago:
            > You're not going to try and extract a timestamp from a uuid.
            
            What? The first 48 bits of an UUID7 are a UNIX timestamp.
            
            Whether or not this is a meaningful problem or a benefit to any
            particular use of UUIDs requires thinking about it; in some cases
            itâs not to be taken lightly and in others it doesnât matter at
            all.
            
            I see what youâre getting at, that ignoring the timestamp aspect
            makes them âjust better UUIDs,â but this ignores security
            implications and the temptation to partition by high bits
            (timestamp).
       
              nine_k wrote 23 hours 25 min ago:
              Nobody forces you to use a real Unix timestamp. BTW the original
              Unix timestamp is 32 bits (expiring in 2038), and now everyone is
              switching to 64-bit time_t. What 48 bits?
              
              All you need is a guaranteed non-decreasing 48-bit number. A
              clock is one way to generate it, but I don't see why a UUIDv7
              would become invalid if your clock is biased, runs too fast, too
              slow, or whatever. I would not count on the first 48 bits being a
              "real" timestamp.
       
                bri3d wrote 22 hours 47 min ago:
                > Nobody forces you to use a real Unix timestamp.
                
                Besides the UUIDv7 specification, that is? Otherwise you have
                some arbitrary kind of UUID.
                
                > I would not count on the first 48 bits being a "real"
                timestamp.
                
                I agree; this is the existential hazard under discussion which
                comes from encoding something that might or might not be data
                into an opaque identifier.
                
                I personally don't agree as dogmatically with the grandparent
                post that extraneous data should _not_ be incorporated into
                primary key identifiers, but I also disagree that "just use
                UUIDv7 and treat UUIDs as opaque" is a completely plausible
                solution either.
       
                  sroussey wrote 17 hours 57 min ago:
                  That is like the HTML specification -- nobody ever puts up a
                  web page that is not conformant. ;p
                  
                  The idea behind putting some time as prefix was for btree
                  efficiency, but lots of people use client side generation and
                  you can't trust it, and it should not matter because it is
                  just an id not a way of registering time.
       
                  nine_k wrote 20 hours 45 min ago:
                  I mean, any 32-bit unsigned integer is a valid Unix timestamp
                  up until 19 January 2038, and, by extension, any u64 is, too,
                  for far longer time.
                  
                  The only promise of Unix timestamps is that they never go
                  back, always increase. This is a property of a sequence of
                  UUIDs, not any particular instance. At most, one might argue
                  that an "utterly valid" UUIDv7 should not contain a timestamp
                  from far future. But I don't see why it can't be any time in
                  the past, as long as the timestamp part does not decrease.
                  
                  The timestamp aspect may be a part of an additional interface
                  agreement: e.g. "we guarantee that this value is UUIDv7 with
                  the timestamp in UTC, no more than a second off". But I
                  assume that most sane engineers won't offer such a guarantee.
                  The useful guarantee is the non-decreasing nature of the
                  prefix, which allows for sorting.
       
          oblio wrote 1 day ago:
          > Well, wrong, since although the date doesn't change.
          
          Someone should have told Julius Caesar and Gregory XIII that :-p
       
          hnfong wrote 1 day ago:
          The curious thing about the article is that, it's definitely
          premature optimization for smaller databases, but when the database
          gets to the scale where these optimizations start to matter, you
          actually don't want to do what they suggest.
          
          Specifically, if your database is small, the performance impact is
          probably not very noticeable. And if your database is large (eg. to
          the extent primary keys can't fit within 32-bit int), then you're
          actually going to have to think about sharding and making the system
          more distributed... and that's where UUID works better than
          auto-incrementing ints.
       
            scottlamb wrote 19 hours 3 min ago:
            I agree there's a scale below which this (or any) optimization
            matters and a scale above which you want your primary key to have
            locality (in terms of which shard/tablet/... is responsible for the
            record). But...
            
            * I think there is a wide range in the middle where your database
            can fit on one machine if you do it well, but it's worth optimizing
            to use a cheaper machine and/or extend the time until you need to
            switch to a distributed db. You might hit this middle range soon
            enough (and/or it might be a painful enough transition) that it's
            worth thinking about it ahead of time.
            
            * If/when you do switch to a distributed database, you don't always
            need to rekey everything:
            
            ** You can spread existing keys across shards via hashing on lookup
            or reversing bits. Some databases (e.g. DynamoDB) actually force
            this.
            
            ** Allocating new ids in the old way could be a big problem, but
            there are ways out. You might be able to switch allocation schemes
            entirely without clients noticing if your external keys are
            sufficiently opaque. If you went with UUIDv7 (which addresses some
            but not all of the article's points), you can just keep using it.
            If you want to keep using dense(-ish), (mostly-)sequential bigints,
            you can amortize the latency by reserving blocks at a time.
       
          hyperpape wrote 1 day ago:
          Your comment is sufficiently generic that itâs impossible to tell
          what specific part of the article youâre agreeing with, disagreeing
          with, or expanding upon.
       
            vintermann wrote 1 day ago:
            I disagree that performance should be a reason to choose running
            numbers over guids until you absolutely have to.
            
            I think IDs should not carry information. Yes, that also means I
            think UUIDv7 was wrong to squeeze a creation date into their ID.
            
            Isn't that clear enough?
       
              hxtk wrote 21 hours 22 min ago:
              When I think "premature optimization," I think of things like
              making a tradeoff in favor of performance without justification.
              It could be a sacrifice of readability by writing uglier but more
              optimized code that's difficult to understand, or spending time
              researching the optimal write pattern for a database that I could
              spend developing other things.
              
              I don't think I should ignore what I already know and
              intentionally pessimize the first draft in the name of avoiding
              premature optimization.
       
              hyperpape wrote 1 day ago:
              Those are two unrelated points and the connection between them
              was unclear in the original post.
       
              anamexis wrote 1 day ago:
              For what itâs worth, it was also completely unclear to me how
              you were responding to the article itself. It does not discuss
              natural keys at all.
       
              barrkel wrote 1 day ago:
              UUID v7 doesn't squeeze creation date in. If you treat it as
              anything other than a random sequence in your applications,
              you're just wrong.
       
                zamadatix wrote 1 day ago:
                "What it does" and "what I think you should do with it" should
                not be treated as equivalent statements.
       
              mcny wrote 1 day ago:
              That's the creation date of that guid though. It doesn't say
              anything about the entity in question. For example, you might be
              born in 1987 and yet only get a social security number in 2007
              for whatever reason.
              
              So, the fact that there is a date in the uuidv7 does not extend
              any meaning or significance to the record outside of the
              database. 
              To infer such a relationship where none exists is the error.
       
                kube-system wrote 1 day ago:
                The time component either has meaning and it should be in its
                own column, or it doesn't have meaning and it is unnecessary
                and shouldn't be there at all.
                
                I'm not a normalization fanatic, but we're only talking about
                1NF here.
       
                vintermann wrote 1 day ago:
                You can argue that, but then what is its purpose? Why should
                anyone care about the creation date of a by-design completely
                arbitrary thing?
                
                I bet people will extract that date and use it, and it's hard
                to imagine use which wouldn't be abuse. To take the example of
                a PN/SSN and the usual gender bit: do you really want anyone to
                be able to tell that you got a new ID at that time? What could
                you suspect if a person born in 1987 got a new PN/SSN around
                2022?
                
                Leaks like that, bypassing whatever access control you have in
                your database, is just one reason to use real random IDs. But
                it's even a pretty good one in itself.
       
                  kentm wrote 9 hours 30 min ago:
                  > You can argue that, but then what is its purpose?
                  
                  The purpose is to reduce randomness while still preserving
                  probability of uniqueness. UUIDv4 come with performance
                  issues when used to bucket data for updates, such as when
                  there used as primary keys in a database.
                  
                  A database like MySQL or PostgreSQL has sequential ids and
                  youâd use those instead, but if youâre writing something
                  like iceberg tables using Trino/Spark/etc then being able to
                  generate unique ids (without using a data store) that tend to
                  be clustered together is useful.
       
                  majorchord wrote 1 day ago:
                  > You can argue that, but then what is its purpose? Why
                  should anyone care about the creation date of a by-design
                  completely arbitrary thing?
                  
                  Pretty sure sorting and filtering them by date/time range in
                  a database is the purpose.
       
                    dpark wrote 21 hours 17 min ago:
                    That is absolutely not the purpose. The specific purpose of
                    uuidv7 is to optimize for B-Tree characteristics, not so
                    you can craft queries based on the IDs being sequential.
                    
                    This assumption that you can query across IDs is exactly
                    what is being cautioned against. As soon as you do that,
                    you are talking a dependency on an implementation detail.
                    The contract is that you get a UUID, not that you get 48
                    bits of timestamp.  There are 8 different UUID types and
                    even v7 has more than one variant.
       
                      kentm wrote 9 hours 28 min ago:
                      B-trees too but also bucketing for formats like delta
                      lake or iceberg, where having ids that cluster will
                      reduce the number of files you need to update.
       
                    miroljub wrote 1 day ago:
                    If you need sorting and filtering by date, just add a
                    timestamp to your table instead of misusing an Id column
                    for that.
       
                      majorchord wrote 23 hours 21 min ago:
                      > just
                      
                      It is easy to have strong opinions about things you are
                      sheltered from the consequences of.
       
                      mixmastamyk wrote 23 hours 23 min ago:
                      That happens, in general.  The benefit comes when itâs
                      time to look up by uuid only; the prefix is an index to
                      its disk block location.
       
                        dpark wrote 21 hours 15 min ago:
                        > the prefix is an index to its disk block location
                        
                        What? This is definitely not the case and canât be
                        because B-tree nodes change while UUIDs do not.
       
                          mixmastamyk wrote 21 hours 14 min ago:
                          I didnât mean that literally, but no longer
                          editable.  Was supposed to have âlikeâ etc in
                          there.
       
                            dpark wrote 20 hours 59 min ago:
                            But UUIDv7 doesnât change that at all. It
                            doesnât matter what flavor of UUID you choose.
                            The ID is always âlikeâ an index to a block in
                            that you traverse the tree to find the node. What
                            UUIDv7 does is improve some performance
                            characteristics when creating new entries and
                            potentially for caching.
       
                      naasking wrote 23 hours 33 min ago:
                      Exactly, be explicit, don't shoehorn multiple purposes
                      into a single column that's supposed to be a largely
                      meaningless unique identifier.
       
                  anamexis wrote 1 day ago:
                  I would argue that is one of very few situations where
                  leaking the timestamp that the ID was created when you
                  already have the ID is a possible concern at all.
                  
                  And when working with very large datasets, there are very
                  significant downsides to large, completely random IDs (which
                  is of course what the OP is about).
       
                  mcny wrote 1 day ago:
                  > What could you suspect if a person born in 1987 got a new
                  PN/SSN around 2022?
                  
                  Thank you for spelling it for me. 
                  For the readers, 
                  It leaks information that the person is likely not a natural
                  born citizen. 
                  The assumption doesn't have to be a hundred percent accurate,
                  
                  There is a way to make that assumption 
                  And possibly hold it against you.
                  
                  And there are probably a million ways that a record created
                  date could be held against you 
                  If they don't put it in writing, how will you prove 
                  They discriminated against you.
                  
                  Thinking... I don't have a good answer to this. If data
                  exists, people will extract meaning from it whether rightly
                  or not.
       
                    infogulch wrote 1 day ago:
                    To quote the great Mr Sparrow:
                    
                    > The only rules that really matter are these: what a man
                    can do and what a man can't do.
                    
                    When evaluating security matters, it's better to strip off
                    the moral valence entirely ("rightly") and only consider
                    what is possible given the data available.
                    
                    Another potential concerning implication besides
                    citizenship status: a person changed their id when put in a
                    witness protection program.
       
          mkleczek wrote 1 day ago:
          This is actually a very deep and interesting topic.
          Stripping information from an identifier disconnects a piece of data
          from the real world which means we no longer can match them. But such
          connection is the sole purpose of keeping the data in the first
          place. So, what happens next is that the real world tries to adjust
          and the "data-less" identifier becomes a real world artifact. The
          situation becomes the same but worse (eg. you don't exist if you
          don't remember your social security id). In extreme cases people are
          tattooed with their numbers.
          
          The solution is not to come up with yet another artificial identifier
          but to come up with better means of identification taking into
          account the fact that things change.
       
            PunchyHamster wrote 16 hours 36 min ago:
            Identifier is just "a piece of common token system can use to
            operate on same entity.
            
            You need it. Because it's maybe one lone unchangeable thing. Taking
            person for example:
            * date of birth can be changed, if there was error and correction
            in documents
            * any and near all of existing physical characteristics can change
            over time, either due to brain things (deciding to change gender),
            aging, or accidents (fingerprints no longer apply if you burnt your
            skin enough)
            * DNA might be good enough, but that's one fucking long identifier
            to share and one hard to validate in field.
            
            So an unique ID attached to few other parts to identify current
            iteration of individual is the best we have, and the best we will
            get.
       
            brettgriffin wrote 21 hours 23 min ago:
            > Stripping information from an identifier disconnects a piece of
            data from the real world which means we no longer can match them.
            But such connection is the sole purpose of keeping the data in the
            first place.
            
            The surrogate key's purpose isn't to directly store the natural
            key's information, rather, it's to provide an index to it.
            
            > The solution is not to come up with yet another artificial
            identifier but to come up with better means of identification
            taking into account the fact that things change.
            
            There isn't 'another' - there's just one. The surrogate key. The
            other pieces of information you're describing are not the means of
            indexing the data. They are the pieces of data you wish to
            retrieve.
       
              mkleczek wrote 20 hours 36 min ago:
              Any piece of information that can be used to retrieve something
              using this index has to be available "outside" your database -
              ie. to issue a query "give me piece of information identified by
              X" you have to know X first. If X is only available in your index
              then you must have another index to retrieve X based on some
              externally available piece of information Y. And then X becomes
              useless as an identifier - it just adds a level of indirection
              that does not solve any information retrieval problem.
              
              That's my whole point: either X becomes a "real world artifact"
              or it is useless as identifier.
       
                brettgriffin wrote 1 hour 24 min ago:
                That's not really how data is requested. Most of these
                identifiers are foreign keys - they exist in a larger object
                graph. Most systems of records are too large for people to
                associate surrogate keys to anything meaningful - they can
                easily have hundreds of billions of records.
                
                Rather, users traverse that through that object graph,
                narrowing a range of keys of interest.
                
                This hacker news article was given a surrogate key, 46272487.
                From that, you can determine what it links to, the
                name/date/author of the submission, comments, etc.
                
                46272487 means absolutely nothing to anybody involved. But if
                you wanted to see submissions from user pil0u, or submissions
                submissions on 2025-12-15, or submissions pertaining to UUID,
                46272487 would in that in that result set. Once 46272487 joins
                out to all of its other tables, you can populate a list that
                includes their user name, title, domain, etc.
                
                Do not encode identifying information in unique identifiers!
                The entire world of software is built on surrogate keys and
                they work wonderfully.
       
            everforward wrote 1 day ago:
            > The solution is not to come up with yet another artificial
            identifier but to come up with better means of identification
            taking into account the fact that things change.
            
            I think artificial and data-less identifiers are the better means
            of identification that takes into account that things change.  They
            don't have to be the identifier you present to the world, but
            having them is very useful.
            
            E.g. phone numbers are semi-common identifiers now, but phone
            numbers change owners for reasons outside of your control.  If you
            use them as an internal identifier, changing them between accounts
            gets very messy because now you don't have an identifier for the
            person who used to have that phone number.
            
            It's much cleaner and easier to adapt if each person gets an
            internal context-less identifier and you use their phone number to
            convert from their external ID/phone number to an internal ID.    The
            old account still has an identifier, there's just no external
            identifier that translates to it.  Likewise if you have to change
            your identifier scheme, you can have multiple external IDs that
            translate to the same internal ID (i.e. you can resolve both their
            old ID and their new ID to the same internal ID without insanity in
            the schema).
       
              mkleczek wrote 22 hours 38 min ago:
              > I think artificial and data-less identifiers are the better
              means of identification that takes into account that things
              change. They don't have to be the identifier you present to the
              world, but having them is very useful.
              
              If the only reason you need a surrogate key is to introduce
              indirection in your internal database design then sequence
              numbers are enough. There is no need to use UUIDs.
              
              The whole discussion is about externally visible identifiers (ie.
              identifiers visible to external software, potentially used as a
              persistent long-term reference to your data).
              
              > E.g. phone numbers are semi-common identifiers now, but phone
              numbers change owners for reasons outside of your control. If you
              use them as an internal identifier, changing them between
              accounts gets very messy because now you don't have an identifier
              for the person who used to have that phone number.
              
              Introducing surrogate keys (regardless of whether UUIDs or
              anything else) does not solve any problem in reality. When I come
              to you and say "My name is X, this is my phone number, this is my
              e-mail, I want my GDPR records deleted", you still need to be
              able to find all data that is related to me. Surrogate keys don't
              help here at all. You either have to be able to solve this issue
              in the database or you need to have an oracle (ie. a person) that
              must decide ad-hoc what piece of data is identified by the
              information I provided.
              
              The key issue here is that you try to model identifiable
              "entities" in your data model, while it is much better to model
              "captured information".
              
              So in your example there is no "person" identified by "phone
              number" but rather "at timestamp X we captured information about
              a person at the time named Y and using phone number Z".
              Once you start thinking about your database as structured storage
              of facts that you can use to infer conclusions, there is much
              less need for surrogate keys.
       
                dpark wrote 19 hours 6 min ago:
                > So in your example there is no "person" identified by "phone
                number" but rather "at timestamp X we captured information
                about a person at the time named Y and using phone number Z".
                Once you start thinking about your database as structured
                storage of facts that you can use to infer conclusions, there
                is much less need for surrogate keys.
                
                This is so needlessly complex that you contradicted yourself
                immediately. You claim there is no âpersonâ identified but
                immediately say you have information âabout a personâ. The
                fact that you can assert that the information is about a person
                means that you have identified a person.
                
                Clearly tying data to the person makes things so much easier. I
                feel like attempting to do what you propose is begging to mess
                up GDPR erasure.
                
                > âSo I got a request from a John Doe to erase all data we
                recorded for them. They identified themselves by mailing
                address and current phone number. So we deleted all data we
                recorded for that phone number.â
                
                > âDid you delete data recorded for their previous phone
                number?â
                
                > âUh, what?â
                
                The stubborn refusal to create a persistent identifier makes
                your job harder, not easier.
       
                everforward wrote 21 hours 55 min ago:
                > If the only reason you need a surrogate key is to introduce
                indirection in your internal database design then sequence
                numbers are enough. There is no need to use UUIDs.
                
                The UUID would be an example of an external key (for e.g.
                preventing crawling keys being easy).  This article mentions a
                few reasons why you may later decide there are better external
                keys.
                
                > When I come to you and say "My name is X, this is my phone
                number, this is my e-mail, I want my GDPR records deleted", you
                still need to be able to find all data that is related to me.
                
                How are you going to trace all those records if the requester
                has changed their name, phone number and email since they
                signed up if you don't have a surrogate key?  All 3 of those
                are pretty routine to change.  I've changed my email and phone
                number a few times, and if I got married my name might change
                as well.
                
                > Once you start thinking about your database as structured
                storage of facts that you can use to infer conclusions, there
                is much less need for surrogate keys.
                
                I think that spirals into way more complexity than you're
                thinking.  You get those timestamped records about "we got info
                about person named Y with phone number Z", and then person Y
                changes their phone number.  Now you're going to start getting
                records from person named Y with phone number A, but it's the
                same account.  You can record "person named Y changed their
                phone number from Z to A", and now your queries have to be
                temporal (i.e. know when that person had what phone number). 
                You could back-update all the records to change Z to A, but
                that breaks some things (e.g. SMS logs will show that you sent
                a text to a number that you didn't send it to).
                
                Worse yet, neither names nor phone numbers uniquely identify a
                person, so it's entirely possible to have records saying
                "person named Y and phone number Z" that refer to different
                people if a phone number transfers from a John Doe to a
                different person named John Doe.
                
                I don't doubt you could do it, but I can't imagine it being
                worth it.  I can't imagine a way to do it that doesn't either
                a) break records by backdating information that wasn't true
                back then, or b) require repeated/recursive querying that will
                hammer the DB (e.g. if someone has had 5 phone numbers, how do
                you get all the numbers they've had without pulling the latest
                one to find the last change, and then the one before that, and
                etc).  Those queries are incredibly simple with surrogate keys:
                "SELECT * FROM phone_number_changes WHERE user_id = blah".
       
                  mkleczek wrote 20 hours 55 min ago:
                  > The UUID would be an example of an external key (for e.g.
                  preventing crawling keys being easy). This article mentions a
                  few reasons why you may later decide there are better
                  external keys.
                  
                  So we are talking about "external" keys (ie. visible outside
                  the database). We are back to square one: externally visible
                  surrogate keys are problematic because they are detached from
                  real world information they are supposed to identify and
                  hence don't really identify anything (see my example about
                  GDPR).
                  
                  It does not matter if they are random or not.
                  
                  > How are you going to trace all those records if the
                  requester has changed their name, phone number and email
                  since they signed up if you don't have a surrogate key?
                  
                  And how does surrogate key help? I don't know the surrogate
                  key that identifies my records in your database.
                  Even if you use them internally it is an implementation
                  detail.
                  
                  If you keep information about the time information was
                  captured, you can at least ask me "what was your phone number
                  last time we've interacted and when was it?"
                  
                  > I think that spirals into way more complexity than you're
                  thinking.
                  
                  This complexity is there whether you want it or not and
                  you're not going to eliminate it with surrogate keys. It has
                  to be explicitly taken care of.
                  
                  DBMSes provide means to tackle this essential complexity:
                  bi-temporal extensions, views, materialized views etc.
                  
                  Event sourcing is a somewhat convoluted way to attack this
                  problem as well.
                  
                  > Those queries are incredibly simple with surrogate keys:
                  "SELECT * FROM phone_number_changes WHERE user_id = blah".
                  
                  Sure, but those queries are useless if you just don't know
                  user_id.
       
                    dpark wrote 18 hours 50 min ago:
                    > externally visible surrogate keys are problematic because
                    they are detached from real world information they are
                    supposed to identify and hence don't really identify
                    anything (see my example about GDPR).
                    
                    All IDs are detached from the real world. Thatâs the core
                    premise of an ID. Itâs a bit of information that is
                    unique to someone or something, but it is not that person
                    or thing.
                    
                    Your phone number is a random number that the phone company
                    points to your phone. Your house has a street name and
                    number that someone decided to assign to it. Your email is
                    an arbitrary label that is used to route mail to some
                    server. Your social security number is some arbitrary id
                    the government assigned you. Even your name is an arbitrary
                    label that your parents assigned to you.
                    
                    Fundamentally your notion that there is some âreal
                    worldâ identifier is not true. No identifiers are real.
                    They are all abstractions and the question is not whether
                    the ârealâ identifier is better than a âfakeâ one,
                    but whether an existing identifier is better than one you
                    create for your system.
                    
                    I would argue that in most cases, creating your own ID is
                    going to save you headaches in the long term. If you bake
                    SSN or Email or Phone Number throughout your system, you
                    will make it a pain for yourself when inevitably someone
                    needs to change their ID and you have cascading updates
                    needed throughout your entire system.
       
                    halffullbrain wrote 19 hours 57 min ago:
                    In my country, citizens have an "ID" (a UUID, which most
                    people don't know the value of!) and a social security
                    number which they know - which has all the problems
                    described above).
                    While the social security number may indeed change (doubly
                    assigned numbers, gender reassignment, etc.), the ID
                    needn't change, since it's the same physical person.
                    
                    Public sector it-systems may use the ID and rely on it not
                    changing.
                    
                    Private sector it-systems can't look up people by their ID,
                    but only use the social security number for comparisons and
                    lookups, e.g. for wiping records in GDPR "right to be
                    forgotten"-situations. Social security numbers are
                    sortof-useful for that purpose because they are printed on
                    passports, driver's licenses and the like. And they are a
                    problem w.r.t. identity theft, and shouldn't ever be used
                    as an authenticator (we have better methods for that).
                    The person ID isn't useful for identity theft, since it's
                    only used between authorized contexts (disregarding
                    Byzantine scenarios with rogue public-sector actors!). You
                    can't social engineer your way to personal data using that
                    ID unless (safe a few movie-plot scenarios).
                    
                    So what is internal in this case? The person id is indeed
                    internal to the public sector's it-systems, and useful for
                    tracking information between agencies. They're not useful
                    for Bob or Alice. (They ARE useful for Eve, or other
                    malicious inside actors, but that's a different story,
                    which realistically does require a much higher level of
                    digital maturity across the entire society)
       
            Ukv wrote 1 day ago:
            > Stripping information from an identifier disconnects a piece of
            data from the real world which means we no longer can match them.
            But such connection is the sole purpose of keeping the data in the
            first place.
            
            The identifier is still connected to the user's data, just through
            the appropriate other fields in the table as opposed to embedded
            into the identifier itself.
            
            > So, what happens next is that the real world tries to adjust and
            the "data-less" identifier becomes a real world artifact. The
            situation becomes the same but worse (eg. you don't exist if you
            don't remember your social security id). In extreme cases people
            are tattooed with their numbers.
            
            Using a random UUID as primary key does not mean users have to
            memorize that UUID. In fact in most cases I don't think there's
            much reason for it to even be exposed to the user at all.
            
            You can still look up their data from their current email or phone
            number, for instance. Indexes are not limited to the primary key.
            
            > The solution is not to come up with yet another artificial
            identifier but to come up with better means of identification
            taking into account the fact that things change.
            
            A fully random primary key takes into account that things change -
            since it's not embedding any real-world information. That said I
            also don't think there's much issue with embedding creation time in
            the UUID for performance reasons, as the article is suggesting.
       
              marcus_holmes wrote 11 hours 6 min ago:
              > You can still look up their data from their current email or
              phone number, for instance. Indexes are not limited to the
              primary key.
              
              This is the key point, I think. Searching is not the same as
              identifying.
       
              mkleczek wrote 22 hours 59 min ago:
              > Using a random UUID as primary key does not mean users have to
              memorize that UUID. In fact in most cases I don't think there's
              much reason for it to even be exposed to the user at all.
              
              So what is such an identifier for? Is it only for some technical
              purposes (like replication etc.)?
              
              Why bother with UUID at all then for internal identifiers?
              Sequence number should be enough.
       
                Ukv wrote 22 hours 19 min ago:
                UUIDs are good for creating entries concurrently where
                coordinating between distributed systems may be difficult.
                
                May also be that you don't want to leak information like how
                many orders are being made, as could be inferred from a
                `/fetch_order?id=123` API with sequential IDs.
                
                Sequential primary keys are still commonly used though - it's a
                scenario-dependant trade-off.
       
                  mkleczek wrote 21 hours 57 min ago:
                  If you expose the identifier outside the database, it is no
                  longer "internal".
       
                    Ukv wrote 21 hours 8 min ago:
                    Given the chain was:
                    
                    > > Using a random UUID as primary key does not mean users
                    have to memorize that UUID. [...]
                    
                    > So what is such an identifier for? [...] Why bother with
                    UUID at all then for internal identifiers?
                    
                    The context, that you're questioning what they're useful
                    for if not for use by the user, suggests that "internal"
                    means the complement. That is, IDs used by your company and
                    software, and maybe even API calls the website makes, but
                    not anything the user has to know.
                    
                    Otherwise, if "internal" was intended to mean something
                    stricter (only used by a single non-distributed database,
                    not accessed by any applications using the database, and
                    never will be in the future), then my response is just that
                    many IDs are neither internal in this sense nor intended to
                    be memorized/saved by the user.
       
                sethhochberg wrote 22 hours 24 min ago:
                "Internal" is a blurry boundary, though - you pick integer
                sequence numbers and then years on an API gets bolted on to
                your purely internal database and now your system is vulnerable
                to enumeration attacks. Does a vendor system where you
                reference some of your internal data count as "internal"? Is
                UID 1 the system user that was originally used to provision the
                system? Better try and attack that one specifically... the list
                goes on.
                
                UUIDs or other similarly randomized IDs are useful because they
                don't include any ordering information or imply anything about
                significance, which is a very safe default despite the
                performance hits.
                
                There certainly are reasons to avoid them and the article we're
                commenting on names some good ones, at scale. But I'd argue
                that if you have those problems you likely have the resources
                and experience to mitigate the risks, and that true
                randomly-derived IDs are a safer default for most new systems
                if you don't have one of the very specific reasons to avoid
                them.
       
                  mkleczek wrote 21 hours 54 min ago:
                  > "Internal" is a blurry boundary, though
                  
                  Not for me :)
                  
                  "Internal" means "not exposed outside the database" (that
                  includes applications and any other external systems)
       
                    demurgos wrote 21 hours 20 min ago:
                    Internal means "not exposed outside some boundary". For
                    most people, this boundary encompasses something larger
                    than a single database, and this boundary can change.
       
            vrighter wrote 1 day ago:
            You can't take into account the fact that things change when you
            don't know what those changes might be. You might end up needing to
            either rebuild a new database, have some painful migration, or
            support two codepaths to work with both types of keys.
       
              mkleczek wrote 23 hours 2 min ago:
              Network protocol designers know better and by default embed
              protocol version number in message format spec.
              
              I guess you can assign 3-4 bits for identifier version number as
              well.
              
              And yes - for long living data dealing with compatibility issues
              is inevitable so you have to take that into account from the very
              beginning.
       
                vrighter wrote 9 hours 48 min ago:
                when I designed network protocols this is exactly what I did. I
                also did so in file formats  had to create. But a database
                primary kea is not somewhere where that can be easily done.
       
              groundzeros2015 wrote 1 day ago:
              You canât design something by trying to anticipate all future
              changes. things will change and break.
              
              In my personal design sense, I have found keeping away generality
              actually helps my code last longer (based on more concrete ideas)
              and easier to change when those days come.
       
                dpark wrote 22 hours 33 min ago:
                In my experience, virtually every time I bake concrete data
                into identifiers I end up regretting it. This isnât a case of
                trying to predict all possible future changes. Itâs a case of
                trying to not repeat the exact same mistake again.
       
                  groundzeros2015 wrote 19 hours 26 min ago:
                  I donât disagree with that, Iâm disagreeing with this
                  comment that we canât make protocol or data decisions that
                  might change.
       
                    dpark wrote 19 hours 3 min ago:
                    I misunderstood then. I interpreted your comment to say
                    that you eschew generalization (e.g. uuids) in favor of
                    concrete data (e.g. names, email addresses) for ids in your
                    designs.
       
          tacone wrote 1 day ago:
          Fantastic real life example. Italian PNs carry also the gender, which
          something you can change surgically, and you'll eventually run into
          the issue when operating at scale.
          
          I don't agree with the absolute statement, though. Permanent
          identifiers should not generally carry data. There are situations
          where you want to have a way to reconciliate, you have space or speed
          constraints, so you may accept the trade off, md5 your data and store
          it in a primary index as a UUID. Your index will fragment and thus
          you will vacuum, but life will still be good overall.
       
            cozyman wrote 23 hours 33 min ago:
            how does one change their gender surgically?
       
              delichon wrote 22 hours 25 min ago:
              You can't, but since gender isn't defined by anything physical,
              there's no need.
       
                bigstrat2003 wrote 21 hours 21 min ago:
                That is only true if you're using an extremely idiosyncratic
                definition of gender. As far as 95% of English speakers are
                concerned, gender is defined by the body you possess.
       
                  defrost wrote 17 hours 18 min ago:
                  As far as nigh on 100% of Bugis speakers are concerned there
                  has always been five genders and they'll tell you the words
                  in their language they have for them.
                  
                  * [1] It appears to be a cultural construct.
                  
  HTML            [1]: https://en.wikipedia.org/wiki/Buginese_language
       
                    EnergyAmy wrote 11 hours 33 min ago:
                    You and the other person are probably talking past each
                    other. For most people, "gender" is merely the polite way
                    of saying "sex", and that's probably what the other
                    commenter was referring to.
                    
                    Gender in the sense of "the social roles and norms on top
                    of biological sex" is indeed a construct, though heavily
                    informed by the biology that they're based on. Biological
                    sex is very much real and not a construct.
       
                      defrost wrote 10 hours 50 min ago:
                      Of course biological sex is real and strongly bimodal
                      with outliers, who ever said otherwise?
       
                        EnergyAmy wrote 9 hours 53 min ago:
                        Technically correct, but to be specific sex is binary,
                        not merely bimodal. Sex is entirely defined by gametes,
                        and is binary in anisogamous species such as humans.
                        Isogamous species don't have sexes, they have mating
                        types (and often many thousands of them).
                        
                        There's actually an ideological movement to try to
                        redefine sex based on sex traits instead of gametes,
                        but this ends up being incoherent and useless for the
                        field of biology. Biologists have had to publish papers
                        explaining the fundamentals of their field to counter
                        the ideological narrative:
                        
                        Why There Are Exactly Two Sexes [1] That's why I
                        thought it was worth mentioning. Many people are
                        confused because of the culture wars. To bring it back
                        around to the general topic of this thread, it's fine
                        to store someone's sex as a boolean, because sex is
                        binary and immutable. Storing cultural constructs like
                        gender as anything other than an arbitrary string is
                        asking for trouble, though.
                        
  HTML                  [1]: https://link.springer.com/article/10.1007/s105...
       
                          defrost wrote 3 hours 56 min ago:
                          Reproductive sex is determined by gametes .. sure.
                          
                          Not all humans are born with the attribute of
                          reproductive sex via gametes.
                          
                          Hence "biological sex is real and strongly bimodal
                          with outliers"    (in humans, it gets odder elsewhere
                          in animal life on earth) it's just not all
                          reproductive sex, nor is all just strictly M or
                          strictly F despite it mostly being one or the other.
                          
                          > To bring it back around to the general topic of
                          this thread, it's fine to store someone's sex as a
                          boolean, because sex is binary and immutable.
                          
                          Not in Australia, via a decision that ascended
                          through all levels of the national court system, nor
                          is sex, as you've chosen to define it ("entirely
                          defined by gametes") binary.
                          
                          Biology is truly messy. It's understandable not
                          everbody truly grasps this.
                          
                          Colin Wright is pretty much a prop up cardboard
                          "scientist" for the Manhattan Institute (a political
                          conservative think tank).
                          
                          I tend to run with people with actual field
                          credentials doing real biology and medicine; Michael
                          Alpers, Fiona Stanley, Fiona Wood, et al were my
                          influences.
                          
                          If Colin Wright scratches your itch for bad biology
                          then by all means run with the one hit wonder who
                          reinforces a preconception untroubled by empiricism.
       
                            EnergyAmy wrote 1 hour 44 min ago:
                            You can't legislate reality away. If you're
                            tracking biological sex, then it doesn't matter
                            what a court decides. If you're tracking legal
                            fictions then you might.
                            
                            I look forward to your citation disputing the truth
                            of what he lays out in that paper. In the meantime,
                            feel free to peruse the list here of people
                            affirming the same stance: [1] Or someone else: [2]
                            You should ask the people you run with why no human
                            is born with a body not organized around the
                            production of gametes. You'll notice that when you
                            read about conditions like anorchia or ovarian
                            agenesis, the sex of the person with that condition
                            is not a mystery, it's literally in the name.
                            
                            Biology is messy indeed, and that's why finding
                            such a universal definition was so useful.
                            
  HTML                      [1]: https://projectnettie.wordpress.com/
  HTML                      [2]: https://www.nas.org/academic-questions/33/...
       
                  lovich wrote 17 hours 25 min ago:
                  The only real states of matter are solids, liquids, and
                  gases. Everything else is just woke lunacy.
                  
                  I am confident in this fact because I learned it in
                  elementary school decades ago and it is impossible for
                  humanity to discover new information that updates our world
                  model. Every English speaker knows that âplasmasâ and
                  âBose-Eisenstein condensatesâ are made up.
       
                    brigandish wrote 14 hours 47 min ago:
                    We all await your Nobel for finding a third type of gamete.
       
                      lovich wrote 13 hours 15 min ago:
                      The person I was responding to was talking about gender,
                      but if you want to talk about biology then [1] [2] [3]
                      [4] [5] [6] [7] I assume you will be one of the advocates
                      for my nobel prize
                      
                      edit: I'm sorry you specifically mentioned gametes, we
                      can talk about diploids and haploids if you wish and how
                      our bodies are such complicated machines that any sort of
                      error that can occur in our growth is guaranteed to at
                      scale
                      
  HTML                [1]: https://en.wikipedia.org/wiki/Intersex#Prevalenc...
  HTML                [2]: https://en.wikipedia.org/wiki/Klinefelter_syndro...
  HTML                [3]: https://en.wikipedia.org/wiki/XXYY_syndrome
  HTML                [4]: https://en.wikipedia.org/wiki/XXXY_syndrome
  HTML                [5]: https://en.wikipedia.org/wiki/XXXYY_syndrome
  HTML                [6]: https://en.wikipedia.org/wiki/XXXXY_syndrome
  HTML                [7]: https://en.wikipedia.org/wiki/Trisomy_X
       
                        brigandish wrote 10 hours 41 min ago:
                        Not only have you undermined your claim to a Nobel
                        award by showing a spurious understanding of biology,
                        you wrote, quite sarcastically "it is impossible for
                        humanity to discover new information that updates our
                        world model". Well then, we will all await your
                        discovery of that 3rd gamete, or some theory so
                        innovative that it tips this well studied, well
                        understood, uncontested (by any valid competitor) model
                        to the wayside and humanity can revel in this new
                        information, the better model of reality that you
                        promise.
                        
                        While you're at it, you could tell us all what the
                        scientific discovery was that made gender separate from
                        sex, who found it and when, and what the defining
                        difference is. Did they win a Nobel for that?
                        
                        I request that in any reply, you refrain from spamming
                        me with Wikipedia links to articles you don't
                        understand and probably haven't read.
       
                        EnergyAmy wrote 11 hours 54 min ago:
                        XXY/etc are all variations within a sex. The above
                        poster is correct to point out that sex is defined
                        entirely by the gamete size that one's body is
                        organized around producing in anisogamous species like
                        humans, and is binary.
                        
                        Intersex is a misleading term, the better term is [1] .
                        There are male DSDs and female DSDs. Even in the case
                        of ovotestes, you'll have one gamete produced, and the
                        other tissue will be nonfunctional.
                        
  HTML                  [1]: https://en.wikipedia.org/wiki/Disorders_of_sex...
       
                          lovich wrote 11 hours 23 min ago:
                          And yet, the original person I was responding to
                          spoke about gender.
                          
                          If you are going to step into this argument, please
                          do not move the goalposts
                          
                          edit: I've triggered the HN censor bot, so editing to
                          apologize to EnergyAmy, they are correct on their
                          point. I am still going to throw back at brigandish
                          that they moved the goalposts
       
                            EnergyAmy wrote 11 hours 18 min ago:
                            I'm responding specifically to your comment in
                            regards to "but if you want to talk about biology
                            then" followed by a list of biological variations
                            that don't dispute the sex binary. The goalposts
                            are exactly where you've left them.
       
                  Terr_ wrote 21 hours 0 min ago:
                  Does that mean hundreds of years of English-speakers
                  referring to sailing ship as "she" were all part of a
                  conspiracy to hide that ships have jiggly bits? :p
       
                    pezezin wrote 4 hours 19 min ago:
                    Wait until you find gendered languages (like most languages
                    in Europe) and realize that grammatical gender usually
                    doesn't have anything to do with biological sex :P
       
              WalterSlovotsky wrote 22 hours 48 min ago:
              The preferred method would be gender affirming surgery.
       
                cozyman wrote 22 hours 19 min ago:
                affirming would mean the change has already taken place
       
                  BobaFloutist wrote 19 hours 47 min ago:
                  Right, because it has. The change in gender identity (or in
                  choosing to make said identity more public )has already taken
                  place, and the surgery seems to affirm that.
       
            mckirk wrote 1 day ago:
            I'm not sure whether that was intended, but 'operating at scale'
            actually made me laugh out loud :D
       
              benterix wrote 22 hours 34 min ago:
              I have to admit an unintended chuckle, too.
       
          oncallthrow wrote 1 day ago:
          It sounds to me like youâre just arguing for premature optimization
          of another kind (specifically, prematurely changing your entire
          architecture for edge cases that probably wonât ever happen to
          you).
       
            vintermann wrote 1 day ago:
            If you have an architecture already, obviously it's hard to change
            and you may want to postpone it until those edge cases which
            probably won't ever happen to you, happen. But for new
            architectures, value your own grey hairs over small performance
            improvements.
       
        raxxorraxor wrote 1 day ago:
        > Do not assume that UUIDs are hard to guess; they should not be used
        as security capabilities
        
        The issue is that is true for more or less all capability URLs. I
        wouldn't recommend UUIDs per se here, probably better to just use a
        random number. I have seen UUIDs for this in practice though and these
        systems weren't compromised because of that.
        
        I hate the tendency that password recovery flows for example leave the
        URL valid for 5 minutes. Of course these URLs need to have a limited
        life time, but mail isn't a real time communication medium. There is
        very little security benefit from reducing it from 30 minutes to 5
        minutes for example. You are not getting "securer" this way.
       
        K0nserv wrote 1 day ago:
        An additional thing I learned when I worked on a ulid alternative over
        the weekend[0] is: Postgres's internal Datum type is at most 64 bits
        which means every uuid requires heap allocation[1] (at least until we
        get 128 bit machines).
        
        0: [1] 1:
        
  HTML  [1]: https://bsky.app/profile/hugotunius.se/post/3m7wvfokrus2g
  HTML  [2]: https://github.com/postgres/postgres/blob/master/src/backend/u...
       
          qntmfred wrote 20 hours 1 min ago:
          you may be interested in this postgres extension as well
          
  HTML    [1]: https://github.com/blitss/typeid-postgres
       
            K0nserv wrote 15 hours 47 min ago:
            That is indeed interesting.
            
            I have slightly different goals for my version. I want everything
            to fit in 128 bits so I'm sacrificing some of the random bits, I'm
            also making sure the representation inside Postgres is also exactly
            128 bits. My initial version ended up using CBOR encoding and being
            160 bits.
            
            Mine dedicates 16 bits for the prefix allowing up to 3 characters
            (a-z alphabet).
       
        reactordev wrote 1 day ago:
        I fun trick I did was generate UUID-like ids. We all can identify a
        UUIDv4 most of the time by looking at one. "Ah, a uuid" we say to
        ourselves. A little over a decade ago I was working on a massive cloud
        platform and rather than generate string keys like the author above
        suggested (int -> binary -> base62 str) we opted for a more "clever"
        approach.
        
        The UUID is 128bits. The first 64bits are a java long. The last 64bits
        are a java long. Let's just combine the Tenant ID long with a Resource
        ID long to generate a unique id for this on our platform. (worked until
        it didn't).
       
          nrhrjrjrjtntbt wrote 1 day ago:
          Atlassian settles for longer "ARIs" for this (e.g. [1] ) composed of
          guids which allow a scheme like the Amazon ARN to pass around.
          
  HTML    [1]: https://developer.atlassian.com/cloud/guard-detect/developer...
       
            reactordev wrote 1 day ago:
            yeah, the problem for us was the resource id. What id was it? Was
            it a post? an upload? a workspace? it wasn't nearly as descriptive
            as we needed it to be.
       
        cebert wrote 1 day ago:
        Using UUIDs as primary keys in non-relational databases like DynamoDB
        is valid and doesnât raise the concerns mentioned in the article.
       
          andatki wrote 1 day ago:
          Good point that the post should be made clear itâs referring only
          to my experience with Postgres.
       
        bux93 wrote 1 day ago:
        Long article about why not to use UUIDv4 as Primary Keys, but.. Who is
        doing so? And why are they doing that? How would you solve their
        requirements? Just throwing out "you can use UUIDv7" doesn't help with,
        e.g., the size they take up.
        
        Aren't people using (big)ints are primary keys, and using UUIDs as
        logical keys for import/export, solving portability across different
        machines?
       
          andatki wrote 15 hours 23 min ago:
          This was written based on working on several Postgres databases at
          different companies of âmediumâ size as a consultant, that had
          excessive IO and latency and used UUID v4 PKs/FKs. Theyâre
          definitely out there. We could transform the schema for some key
          tables as a demonstration with big int equivalents and show the IO
          latency reduction. With that said, the real world PK data type
          migration is costly but becomes a business decision of whether to do
          or not.
       
          Sayrus wrote 1 day ago:
          UUIDs are usually the go-to solution to enumeration problems. The
          space is large enough that an attacker cannot guess how many X you
          have (invoices, users, accounts, organizations, ...). When people
          replace the ints by UUIDv4, they keep them as primary keys.
       
            bruce511 wrote 1 day ago:
            I'd add that it's also used when data is created in multiple
            places.
            
            Consider say weather hardware. 5 stations all feeding into a
            central database. They're all creating rows and uploading them.
            Using sequential integers for that is unnecessarily complex (if
            even possible.)
            
            Given the amount of data created on phones and tablets, this
            affects more situations than first assumed.
            
            It's also very helpful in export / edit / update situations.  If I
            export a subset of the data (let's say to Excel), the user can edit
            all the other columns and I can safely import the result. With
            integer they might change the ID field (which would be bad). With
            uuid they can change it, but I can ignore that row (or the whole
            file) because what they changed it to will be invalid.
       
              nrhrjrjrjtntbt wrote 1 day ago:
              Yes and the DB might be columnular or a distributed KV,
              sidestepping the index problem.
       
        dfox wrote 1 day ago:
        > Creating obfuscated values using integers
        
        While that is often neat solution, do not do that by simply XORing the
        numbers with constant. Use a block cipher in ECB mode (If you want the
        ID to be short then something like NSA's Speck comes handy here as it
        can be instantiated with 32 or 48 bit block).
        
        And do not even think about using RC4 for that (I've seen that multiple
        times), because that is completely equivalent to XORing with constant.
       
        socketcluster wrote 1 day ago:
        My advice is: Avoid Blanket Statements About Any Technology.
        
        I'm tired of midwit arguments like "Tech X is N% faster than tech Y at
        performing operation Z. Since your system (sometimes) performs
        operation Z, it implies that Tech X is the only logical choice in all
        situations!"
        
        It's an infuriatingly silly argument because operation Z may only
        represent about 10% of the total CPU usage of the whole system
        (averaged out)... So what is promoted as a 50% gain may in fact be a 5%
        gain when you consider it in the grand scheme of things... Negligible.
        If everyone was looking at this performance 'advantage' rationally;
        nobody would think it's worth sacrificing important security or
        operational properties.
        
        I don't know what happened to our industry; we're supposed to be
        intelligent people but I see developers falling for these obvious
        logical fallacies over and over.
        
        I remember back in my day, one of the senior engineers was discussing
        upgrading a python system and stated openly that the new version of the
        engine was something like 40% slower than the old version but he didn't
        even have to explain himself why upgrading was still a good decision;
        everybody in the company knew he was only talking about the code
        execution speed and everybody knew that this was a small fraction of
        the total.
        
        Not saying UUIDv7 was a bad choice for Postgres. I'm sure it's fine for
        a lot of situations but you don't have to start a cult preaching the
        gospel of The One True UUID to justify your favorite project's
        decisions.
        
        I do find it kind of sly though how the community decided to make this
        UUIDv7 instead of creating a new standard for it.
        
        The whole point of UUID was to leverage the properties of randomness to
        generate unique IDs without requiring coordination. UUIDv7 seems to
        take things in a philosophically different path. People chose UUID for
        scalability and simplicity (both of which you get as a result of doing
        away with the coordination overhead), not for raw performance...
        
        That's the other thing which drives me nuts; people who don't
        understand the difference between performance and scalability. People
        foolishly equate scalability with parallelism or concurrency; whereas
        that's just one aspect of it; scalability is a much broader topic. It's
        the difference between a theoretical system which is fast given a
        certain artificially small input size and one which actually performs
        better as the input size grows.
        
        Lastly; no mention is made about the complex logic which has to take
        place behind the scenes to generate UUIDv7 IDs... People take it for
        granted that all computers have a clock which can produce accurate
        timestamps where all computers in the world are magically in-sync...
        UUIDv7 is not simple; it's very complicated. It has a lot of additional
        complexity and dependencies compared to UUIDv4. Just because that
        complexity is very well hidden from most developers, doesn't mean it's
        not there and that it's not a dependency... This may become especially
        obvious as we move to a world of robotics and embedded systems where
        cheap microchips may not have enough Flash memory to hold the code for
        the kinds of programs required to compute such elaborate IDs.
       
          thraxil wrote 1 day ago:
          Yep. We have tables that use UUIDv4 that have 60M+ rows and don't
          have any performance problems with them. Would some queries be faster
          using something else? Probably, but again, for us it's not close to
          being a bottleneck. If it becomes a problem at 600M or 6B rows, we'll
          deal with it then. We'll probably switch to UUIDv7 at some point, but
          it's not a priority and we'll do some tests on our data first. Does
          my experience mean you should use UUIDv4? No. Understand your own
          system and evaluate how the tradeoffs apply to you.
       
            ezekg wrote 21 hours 48 min ago:
            I have tables that have billions of rows that use UUIDv4 primary
            keys and I haven't encountered any issues either. I do use UUIDv7
            for write-heavy tables, but even then, I got a way bigger
            performance boost from batching inserts than switching from UUIDv4
            to UUIDv7. Issue is way overblown.
       
            p2detar wrote 23 hours 7 min ago:
            Nice feedback. Out of curiosity, have you made any fine-tuning to
            psql that greatly improved performance?
       
              thraxil wrote 6 hours 15 min ago:
              Nope. Out of the box GCP Cloud SQL instance.
       
          kunley wrote 1 day ago:
          Wasn't choosing uuids as ids falling for the deceptive argument in
          the first place?
       
            christophilus wrote 1 day ago:
            Not really, no. Theyâre very convenient for certain problems and
            work really well in general. Iâve never had a performance issue
            where the problem boiled down to my use of UUID.
       
              danparsonson wrote 1 day ago:
              You never having seen the problem doesn't mean it never happens;
              I have dealt with a serious performance problem in the past that
              was due to excessive page fragmentation due to a GUID PK.
              
              To your original point, these are heuristics; there isn't always
              time to dig into every little architectural decision, so having a
              set of rules of thumb on hand helps to preempt problems at
              minimal cognitive cost. "Avoid using a GUID as a primary key if
              you can" is one of mine.
       
              kunley wrote 1 day ago:
              What are these certain problems, if I may ask?
       
                socketcluster wrote 1 day ago:
                A major one for me is preventing duplicate records.
                
                If the client POSTs a new object to insert it into the
                database; if there is a connection failure and the client does
                not receive a success response from the server, the client
                cannot know whether the record was inserted or not without
                making an expensive and cumbersome additional read call to
                check... The client cannot simply assume that the insertion did
                not happen purely on the basis that they did not receive a
                success response. It could very well be that the insertion
                succeeded but the connection failed shortly after so response
                was not received. If the IDs are auto-incremented on the server
                and the client posts the same object again without any ID on
                it, the server will create a duplicate record in the database
                table (same object with a different ID).
                
                On the other hand, if the client generates a UUID for the
                object it wants to create on the front-end, then it can safely
                resend that exact object any number of times and there is no
                risk of double-insertion; the object will be rejected the
                second time and you can show the user a meaningful error
                "Record was already created" instead of creating two of the
                same resource; leading to potential bugs and confusion.
       
                  kunley wrote 1 day ago:
                  Ehm.. so you're saying that INSERT ... RETURNING id is not
                  atomic from the client's pov because something terrible could
                  happen just when client is receiving the answer inside its
                  SQL driver?
       
                    socketcluster wrote 1 day ago:
                    I'm actually more thinking about the client sitting on the
                    front-end like a single page app. Network instability could
                    cause the response to not reach the front-end after a
                    successful insert. This wouldn't be extremely common but
                    would definitely be a problem for you as the database admin
                    if you have above a certain number of users. I've seen this
                    issue on live production systems and the root cause of
                    duplicate records can be baffling because of how
                    infrequently it may happen. Tends to cause issues that are
                    hard to debug.
       
                      kunley wrote 21 hours 36 min ago:
                      OK got it. I was thinking about SQL client, not about
                      client of a REST service. With that distinction in mind,
                      the reasoning makes sense; thank you.
       
                  mkleczek wrote 1 day ago:
                  Preferably, you would design you APIs and services to be
                  idempotent (ie. use PUT not POST etc.)
                  
                  Using idempotency identifier is the last resort in my book.
       
                    dpark wrote 19 hours 59 min ago:
                    How is URI not an idempotency identifier for PUT?
       
                    socketcluster wrote 1 day ago:
                    Still, UUID is probably the simplest and most reliable way
                    to generate such idempotency identifiers.
       
        scary-size wrote 1 day ago:
        This reminds me about this old gist for generating Firebase-like "push
        IDs" [1]. Those have some nicer properties.
        
  HTML  [1]: https://gist.github.com/mikelehen/3596a30bd69384624c11
       
        Lucasoato wrote 1 day ago:
        Hi, a question for you folks. What if I donât like to embed timestamp
        in uuid as v7 do? This could expose to timing attacks in specific
        scenarios.
        
        Also is it necessary to show uuid at all to customers of an API? Or
        could it be a valid pattern to hide all the querying complexity behind
        named identifiers, even if it could cost a bit in terms of joining and
        indexing?
        
        The context is the classic B2B SaaS, but feel free to share your
        experiences even if it comes from other scenarios!
       
          lwhi wrote 1 day ago:
          Wouldn't you need to expose UUID if you want to make use of
          optimistic locking?
       
            Lucasoato wrote 1 day ago:
            I feel that this is among the good reasons to keep exposing UUID in
            the API.
       
        dimitrisnl wrote 1 day ago:
        Noob question, but why no use ints for PK, and UUIDs for a public_id
        field?
       
          scotty79 wrote 17 hours 19 min ago:
          One of the benefits of UUIDs is that you can easily merge data coming
          from multiple databases. Auto-increments cause collisions.
       
          dsego wrote 1 day ago:
          I also think we can use a combination of a PID - persistent ID (I
          always thought it was public) and an auto-increment integer ID.
          Having a unique key helps when migrating data between systems or
          referencing a piece of data in a different system. Also, using serial
          IDs in URLs and APIs can reveal sensitive information, e.g. how many
          items there are in the database.
       
          alerighi wrote 1 day ago:
          If you put an index on the UUID field (because you have an API where
          you can retrieve objects with UUID) you have kind of the same
          problem, at least in Postgres where a primary key index or a
          secondary index are more or less the same (to the point is perfectly
          valid in pgsql to not have any primary key defined for the table,
          because storage on disk is done trough an internal ID and the
          indexes, being primary or not, just reference to the rowId in
          memory). Plus the waste of space of having 2 indexes for the same
          table.
          
          Of course this is not always the case that is bad, for example if you
          have a lot of relations you can have only one table where you have
          the UUID field (and thus expensive index), and then the relations
          could use the more efficient int key for relations (for example you
          have an user entity with both int and uuid keys, and user attribute
          references the user with the int key, of course at the expense of a
          join if you need to retrieve one user attribute when retrieving the
          user is not needed).
       
            torginus wrote 1 day ago:
            You can create hash indexes in Postgres, so the secondary index
            uuid seems workable:
            
  HTML      [1]: https://www.postgresql.org/docs/current/hash-index.html
       
          grim_io wrote 1 day ago:
          The article mentions microservices, which can increase the likelihood
          of collisions in sequential incremental keys.
          
          One more reason to stay away from microservices, if possible.
       
            mrkeen wrote 1 day ago:
            The 'collision' is two service classes both trying to use one db.
            
            If you separate them (i.e. microservices) the they no longer try to
            use one db.
       
              grim_io wrote 1 day ago:
              There is nothing stopping multiple microservices from using the
              same DB, so of course this will happen in practice.
              
              Sometimes it might even be for a good reason.
       
            bardsore wrote 1 day ago:
            Always try to avoid having two services using the same DB. Only way
            I'd ever consider sharing a DB is if only one service will ever
            modify it and all others only read.
       
              grim_io wrote 1 day ago:
              Good luck enforcing that :)
       
          edding4500 wrote 1 day ago:
          *edit: sorry, misread that. My answer is not valid to your question.
          
          original answer: because if you dont come up with these ints randomly
          they are sequential which can cause many unwanted situations where
          people can guess valid IDs and deduce things from that data. See
          
  HTML    [1]: https://en.wikipedia.org/wiki/German_tank_problem
       
            javaunsafe2019 wrote 1 day ago:
            So We make things hard in the backend because of leaky
            abstractions? Doesn't make sense imo.
       
              jcims wrote 1 day ago:
              Decades of security vulnerabilities and compromises because of
              sequential/guessable PKs is (only!) part of the reason we're
              here.  Miss an authorization check anywhere in the application
              and you're spoon-feeding entire tables to anyone with the
              inclination to ask for it.
       
            javawizard wrote 1 day ago:
            Hence the presumed implication behind the public_id field in GP's
            comment: anywhere identifiers are exposed, you use the public_id
            field, thereby preventing ID guessing while still retaining the
            benefits of ordered IDs where internal lookups are concerned.
            
            Edit: just saw your edit, sounds like we're on the same page!
       
        mcny wrote 1 day ago:
        Postgresql 18 released in September and has uuidv7
        
  HTML  [1]: https://www.postgresql.org/docs/current/functions-uuid.html
       
        waynenilsen wrote 1 day ago:
        What kills me is I canât double click the thing to select it.
       
          mrits wrote 1 day ago:
          This application specific. iTerm2 doesn't break up by - why firefox
          does.
       
        dotancohen wrote 1 day ago:
        From the fine article:
        
          > Random values donât have natural sorting like integers or
        lexicographic (dictionary) sorting like character strings. UUID v4s do
        have "byte ordering," but this has no useful meaning for how theyâre
        accessed.
        
        Might the author mean that random values are not sequential, so
        ordering them is inefficient? Of course random values can be ordered -
        and ordering by what he calls "byte ordering" is exactly how all
        integer ordering is done. And naive string ordering too, like we would
        do in the days before Unicode.
       
          andatki wrote 19 hours 31 min ago:
          Hi there. Thanks for the feedback. I updated that section to
          hopefully convey the intent more. The type of ordering we care about
          for this topic is really B-Tree index traversal when inserting new
          entries and finding existing entries (single and multiple values i.e.
          an IN clause, updates, deletes etc). There's a compelling example I
          re-created from Cybertec showing the pages needed and accessed for
          equivalent user-facing results, comparing storing PKs as big integers
          vs. UUID v4s, and how many more pages were needed for v4 UUIDs. I
          found that to be helpful to support my real world experience as a
          consultant on various "medium sized" Postgres databases (e.g. single
          to 10s of millions of records) where clients were experiencing
          excessive latency for queries, and the UUID v4 PK/FKs selection made
          for reasons earlier was one of the main culprits. The indexes
          wouldnât fit into memory resulting in a lot of sequential scans.
          Iâd confirm this by showing an alternative schema design and set of
          queries where everything was the same except integer PKs/FKs were
          used. Smaller indexes (fit in memory), reliable index scans, less
          latency, faster execution time.
       
          hashmush wrote 22 hours 58 min ago:
          Agree, I did a double take on this too.
          
          Values of the same type can be sorted if a order is defined on the
          type.
          
          It's also strange to contrast "random values" with "integers". You
          can generate random integers, and they have a "sorting" (depending on
          what that means though)
       
          dev_l1x_be wrote 1 day ago:
          Why would you need to order by UUID? I am missing something here.
          Most of the time we use UUID keys for being able to create a new key
          without coordination and most of the time we do not want to order by
          primary key.
       
            sagarm wrote 1 day ago:
            Most common database indexes are ordered, so if you are using
            UUIDv4 you will not only bloat the index you will also have poor
            locality. If you try to use composite keys to fix locality, you'll
            end up with an even more bloated index.
       
            hans_castorp wrote 1 day ago:
            I have seen a lot of people sort by (generated) integer values to
            return the rows "in creation order" assuming that sorting by an
            integer is somehow magically faster than sorting by a proper
            timestamp value (which give a more robust "creation order" sorting
            than a generated integer value).
       
              sgarland wrote 1 day ago:
              Assuming the integer value is the PK, it can in fact be much
              faster for MySQL / MariaDB due to InnoDBâs clustering index. If
              it can do a range scan over the PK, and thatâs also the ORDER
              BY (with matching direction), congratulations, the rows are
              already ordered, no sort required. If it has to do a secondary
              index lookup to find the rows, this is not guaranteed.
       
          K0nserv wrote 1 day ago:
          Isn't part of this that inserting into a btree index is more
          performant when the keys are increasing rather than being random? A
          random id will cause more re-balancing operations than always
          inserting at the end. Increasing ids are also more cache friendly
       
            sgarland wrote 1 day ago:
            Yes, and for Postgres, it also causes WAL bloat due to the high
            likelihood of full page writes.
       
              dotancohen wrote 5 hours 8 min ago:
              Could you expand on this? I use postgres often and though I could
              have an LLM explain what you mean, I think I'd learn more to hear
              it from you. Thank you.
       
          crest wrote 1 day ago:
          Any fixed sized bitstring has an obvious natural ordering, but since
          they're allocated randomly they lack the density and locality of
          sequential allocation.
       
          torginus wrote 1 day ago:
          To be polite, I don't think this article rests on sound technical
          foundations.
       
            mrinterweb wrote 17 hours 29 min ago:
            How so?
       
          dagss wrote 1 day ago:
          The point is how closely located data you access often is. If data is
          roughly sorted by creation time then data you access close to one
          another in time is stored close to one another on disk. And typically
          access to data is correlated with creation time. Not for all tables
          but for many.
          
          Accessing data in totally random locations can be a performance
          issue.
          
          Depends on lots of things ofc but this is the concern when people
          talk about UUID for primary keys being an issue.
       
          kreetx wrote 1 day ago:
          Using an UUIDv4 as primary key is a trade-off: you use it when you
          need to generate unique keys in a distributed manner. Yes, these are
          not datetime ordered and yes, they take 128 bits of space. If you
          can't live with this, then sure, you need to consider alternatives. I
          wonder if "Avoid UUIDv4 Primary Keys" is a rule of thumb though.
       
            torginus wrote 1 day ago:
            I do not understand why 128 bits is considered too big - you
            clearly can't have less, as on 64 bits the collision probability on
            real world workloads is just too high, for all but the smallest
            databases.
            
            Auto-incrementing keys can work, but what happens when you run out
            of integers? Also, distributed dbs probably make this hard, and
            they can't generate a key on client.
            
            There must be something in Postgres that wants to store the records
            in PK order, which while could be an okay default, I'm pretty sure
            you can this behavior, as this isn't great for write-heavy
            workloads.
       
              p1necone wrote 20 hours 36 min ago:
              I doubt many real world use cases would run out of incrementing
              64 bit ids - collisions if they were random sure, but i64 max is
              9,223,372,036,854,775,807 - if each row took only 1 bit of space,
              that would be slightly more than an exabyte of data.
       
              vbezhenar wrote 23 hours 19 min ago:
              You won't run out of 64-bit integer. IMO, 64-bit integer (and
              even less for some tables that's not expected to grow much) it
              the best approach for internal database ID. If you want to expose
              ID, it might make sense to introduce second UUID for selected
              tables, if you want to hide internal ID.
       
              anarazel wrote 1 day ago:
              The issue is more fundamental - if you have purely random keys,
              there's basically no spatial locality for the index data. Which
              means that for decent performance your entire index needs to be
              in memory, rather than just recent data. And it means that you
              have much bigger write amplification, since it's rare that the
              same index page is modified multiple times close-enough in time
              to avoid a second write.
       
            dotancohen wrote 1 day ago:
            If one needs timestamp ordering, then UUIDv7 is a good alternative.
            
            But the author does not say timestamp ordering, he says ordering. I
            think he actually means and believes that there is some problem
            ordering UUIDv4.
       
              kreetx wrote 1 day ago:
              Yup. There are alternatives depending on what the situation is:
              with non-distributed, you could just use a sufficiently sized int
              (which can be rather small when the table is for e.g humans). You
              could add a separate timestamp column if that is important.
              
              But if you need UUID-based lookup, then you might as well have it
              as a primary key, as that will save you an extra index on the
              actual primary key. If you also need a date and the remaining
              bits in UUIDv7 suffice for randomness, then that is a good option
              too (though this does essentially amount to having a composite
              column made up of datetime and randomness).
       
        xandrius wrote 1 day ago:
        To summarise the article: in PG, prefer using UUIDv7 over UUIDv4 as
        they have slightly better performance.
        
        If you're using latest version of PG, there is a plugin for it.
        
        That's it.
       
          hans_castorp wrote 1 day ago:
          With the latest Postgres version (>= 18) you do NOT need a plugin
       
          sbuttgereit wrote 1 day ago:
          You might have missed the big H2 section in the article:
          
          "Recommendation: Stick with sequences, integers, and big integers"
          
          After that then, yes, UUIDv7 over UUIDv4.
          
          This article is a little older.  PostgreSQL didn't have native
          support so, yeah, you needed an extension.  Today, PostgreSQL 18 is
          released with UUIDv7 support... so the extension isn't necessary,
          though the extension does make the claim:
          
          "[!NOTE] As of Postgres 18, there is a built in uuidv7() function,
          however it does not include all of the functionality below."
          
          What those features are and if this extension adds more cruft in
          PostgreSQL 18 than value, I can't tell.  But I expect that the vast
          majority of users just won't need it any more.
       
            tmountain wrote 1 day ago:
            Sticking with sequences and other integer types will cause problems
            if you need to shard later.
       
              SoftTalker wrote 22 hours 16 min ago:
              This is mentioned, and in many applications you can safely say
              you will never need to shard.
       
                tmountain wrote 2 hours 21 min ago:
                Yes, but if you do need to, it's much simpler if you were using
                UUID since the beginning. I'm personally not convinced that any
                of the tradeoffs that comes with a more traditional key are
                worth the headache that could come in a scenario where you do
                need to shard. I started a company last year, and the DB has
                grown wildly beyond our expectations. I did not expect this,
                and it continues to grow (good problem to have). It happens!
       
              bigmadshoe wrote 22 hours 28 min ago:
              Iâm really no expert on sharding but if youâre using
              increasing ints why canât you just shard on (id % n) or
              something?
       
                0x457 wrote 21 hours 8 min ago:
                Because then you run into an issue when you 'n' changes. Plus,
                where are you increasing it on? This will require a single
                fault-tolerant ticker (some do that btw).
                
                Once you encode shard number into ID, you got:
                
                - instantly* know which shard to query
                
                - each shard has its own ticker
                
                * programatically, maybe visually as well depending on
                implementation
                
                I had IDs that encode: entity type (IIRC 4 bit?), timestamp,
                shard, sequence per shard. We even had a admin page wher you
                can paste ID and it will decode it.
                
                id % n is fine for cache because you can just throw whole thing
                away and repopulate or when 'n' never changes, but it usually
                does.
       
                  tmountain wrote 2 hours 23 min ago:
                  ^ This
       
              sgarland wrote 1 day ago:
              There are plenty of ways to deal with that. You can shard by some
              other identifier (though I then question your table design), you
              can assign ranges to each shard, etc.
       
              zwnow wrote 1 day ago:
              Especially in larger systems, how does one solve the issue of
              reaching the max value of an integer in their database? Sure for
              unsigned bigint thats hard to achieve but regular ints? Apps
              quickly outgrow that.
       
                sbuttgereit wrote 1 day ago:
                OK... but that concern seems a bit artificial.. if bigints are
                appropriate: use them.    If the table won't get to bigint sizes:
                don't.    I've even used smallint for some tables I knew were
                going to be very limited in size.  But I wouldn't worry about
                smallint's very limited number of values for those tables that
                required a larger size for more records: I'd just use int or
                bigint for those other tables as appropriate.  The reality is
                that, unless I'm doing something very specific where being
                worried about the number of bytes will matter... I just use
                bigint.  Yes, I'm probably being wasteful, but in the cases
                where those several extra bytes per record are going to really
                add up.... I probably need bigint anyway and in cases where
                bigint isn't going to matter the extra bytes are relatively
                small in aggregate.  The consistency of simply using one type
                itself has value.
                
                And for those using ints as keys... you'd be surprised how many
                databases in the wild won't come close to consuming that many
                IDs or are for workloads where that sort of volume isn't even
                aspirational.
                
                Now, to be fair, I'm usually in the UUID camp and am using
                UUIDv7 in my current designs.  I think the parent article makes
                good points, but I'm after a different set of trade-offs where
                UUIDs are worth their overhead.  Your mileage and use-cases may
                vary.
       
                  zwnow wrote 1 day ago:
                  Idk I use whatever scales best and that would be an close to
                  infinite scaling key. The performance compromise is probably
                  zeroed out once you have to adapt ur database to a different
                  one supporting the current scale of the product. Thats for
                  software that has to scale. Whole different story for stuff
                  that doesnt have to grow obviously. I am in the UUID camp too
                  but I dont care whether its v4 or v7.
       
                    wongarsu wrote 1 day ago:
                    It's not like there are dozens of options and you
                    constantly have to switch. You just have to estimate if at
                    maximum growth your table will have 32 thousand, 2 billion
                    or 9 quintillion entries. And even if you go with 9
                    quintillion for all cases you still use half the space of a
                    UUID
                    
                    UUIDv4 are great for when you add sharding, and UUIDs in
                    general prevent issues with mixing ids from different
                    tables. But if you reach the kind of scale where you have 2
                    billion of anything UUIDs are probably not the best choice
                    either
       
        benterix wrote 1 day ago:
        The article sums up some valid arguments against UUIDv4 as PKs but the
        solution the author provides on how to obfuscate integers is probably
        not something I'd use in production. UUIDv7 still seems like a
        reasonable compromise for small-to-medium databases.
       
          stickfigure wrote 22 hours 31 min ago:
          In Postgres I often like to use a single sequence for everything. It
          leaks some information yes but in a busy system it tends to be
          "obscure enough".
       
            x0x0 wrote 20 hours 53 min ago:
            It's not leaking that's the concern.  It's that not having the
            names of objects be easily enumerable is a strongly
            security-enhancing feature of a system.
            
            Yes of course everyone should check and unit test that every object
            is owned by the user or account loading it, but demanding more
            sophistication from an attacker than taking "/my_things/23" and
            loading "/my_things/24" is a big win.
       
              stickfigure wrote 18 hours 2 min ago:
              With a single sequence and a busy system, the ids for most
              high-level tables/collection are extremely sparse. This doesn't
              mean they can't be enumerated, but you will probably notice if
              you suddenly start getting hammered with 404s or 410s or whatever
              your system generates on "not found".
              
              Also, if most of your endpoints require auth, this is not
              typically a problem.
              
              It really depends on your application. But yes, that's something
              to be aware of. If you need some ids to be unguessable, make sure
              they are not predictable :-)
       
                thayne wrote 12 hours 25 min ago:
                If you have a busy system, a single sequence is going to be a
                pretty big performance bottleneck, since every resource
                creation will need to acquire a lock on that sequence.
       
                x0x0 wrote 16 hours 35 min ago:
                > Also, if most of your endpoints require auth, this is not
                typically a problem.
                
                Many systems are not sparse, and separately, that's simply
                wrong.    Unguessable names is not a primary security measure,
                but a passive remediation for bugs or bad code.  Broken access
                control remains an owasp top 10, and idor is a piece of that. 
                Companies still get popped for this.
                
                See, eg, google having a bug in 2019, made significantly less
                impactful by unguessable names
                
  HTML          [1]: https://infosecwriteups.com/google-did-an-oopsie-a-sim...
       
          formerly_proven wrote 1 day ago:
          If all you want is to obfuscate the fact that your social media site
          only has 200 users and 80 posts, simply use a permutation over the
          autoincrement primary key. E.g. IDEA or CAST-128, then encode in
          base64. If someone steps on your toes because somewhere in your
          codebase you're using a forbidden legacy cipher, just use AES-128.
          (This is sort of the degenerate/tautological base case of
          format-preserving encryption)
          
          (What do you think Youtube video IDs are?)
       
            conradfr wrote 1 day ago:
            Can't you just change the starting value of your sequence?
       
            enz wrote 1 day ago:
            The problem with this approach is that you now have to manage a
            secret key/secret for a (maybe) a very long time.
            
            I shared this article a few weeks ago, discussing the problems with
            this kind of approach: [1] I believe it can make sense in some
            situations, but do you really want to implement such crypto-related
            complexity?
            
  HTML      [1]: https://notnotp.com/notes/do-not-encrypt-ids/
       
              formerly_proven wrote 23 hours 54 min ago:
              The article is self-contradictory in that it acts like that key
              is super-important ("Operations becomes a nightmare. You now have
              a cryptographic secret to manage. Where does this key live?
              Protected by a wrapping key living in a KMS or HSM? Do you use
              the same key across prod, staging, and dev? If dev needs to test
              with prod data, does it need access to prod encryption keys? What
              about CI pipelines? Local developer machines?") but then also
              acknowledges that we're talking about an obfuscation layer of
              stuff which is not actually sensitive ("to hide timestamps that
              aren't sensitive"). Don't get me wrong, it's a definitive
              drawback for scaling the approach, but most applications have to
              manage various secrets, most of which are actually important.
              E.g. session signing keys, API keys etc. It's still common for
              applications to use signed session with RCE data formats. The
              language from that article, while not wrong, is much more apt for
              those keys.
              
              That being said, while fine for obfuscation, it should not be
              used for security for this purpose, e.g. hidden/unlisted links,
              confirmation links and so on. Those should use actual, long-ish
              random keys for access, because the inability to enumerate them
              is a security feature.
       
            Retr0id wrote 1 day ago:
            Why not use AES-128 by default? Your CPU has instructions to
            accelerate AES-128.
       
            benterix wrote 1 day ago:
            I always thought they are used and stored as they are because the
            kind of transformation you mention seems terribly expensive given
            the YT's scale, and I don't see a clear benefit of adding any kind
            of obfuscation here.
       
            pdimitar wrote 1 day ago:
            > What do you think Youtube video IDs are?
            
            I actually haven no idea. What are they?
            
            (Also what is the format of their `si=...` thing?)
       
              intalentive wrote 23 hours 3 min ago:
              Canât recall where I heard this, but Iâm pretty sure the
              si=â¦ is tracking information that associates the link with the
              user who shared it.
       
                pdimitar wrote 22 hours 42 min ago:
                Oh absolutely, I am just wondering _what_ does it contain.
       
          mort96 wrote 1 day ago:
          I tend to avoid UUIDv7 and use UUIDv4 because I don't want to leak
          the creation times of everything.
          
          Now this doesn't work if you actually have enough data that the
          randomness of the UUIDv4 keys is a practical database performance
          issue, but I think you really have to think long and hard about every
          single use of identifiers in your application before concluding that
          v7 is the solution. Maybe v7 works well for some things (e.g
          identifiers for resources where creation times are visible to all
          with access to the resource) but not others (such as users or orgs
          which are publicly visible but without publicly visible creation
          times).
       
            barrkel wrote 1 day ago:
            You shouldn't generally use PKs as public identifiers, least of all
            UUIDs, which are pretty user hostile.
       
              mort96 wrote 1 day ago:
              I really don't see the issue with having a UUID in a URL.
       
            throw0101a wrote 1 day ago:
            > I tend to avoid UUIDv7 and use UUIDv4 because I don't want to
            leak the creation times of everything.
            
            See perhaps "UUIDv47 â UUIDv7-in / UUIDv4-out (SipHashâmasked
            timestamp)":
            
            * [1] * Sept 2025:
            
  HTML      [1]: https://github.com/stateless-me/uuidv47
  HTML      [2]: https://news.ycombinator.com/item?id=45275973
       
              wongarsu wrote 1 day ago:
              If that kind of stuff is on the able you can also use boring
              64bit integer keys and encrypt those (e.g. [1]). Which in the end
              is just a better thought out version of what the article author
              did.
              
              UUIDv47 might have a space if you need keys generated on multiple
              backend servers without synchronization. But it feels very niche
              to me.
              
              1:
              
  HTML        [1]: https://wiki.postgresql.org/wiki/XTEA_(crypt_64_bits)
       
            nbadg wrote 1 day ago:
            I'm also not a huge fan of leaking server-side information; I
            suspect UUIDv7 could still be used in statistical analysis of the
            keyspace (in a similar fashion to the german tank problem for
            integer IDs). Also, leaking data about user activity times (from
            your other comment) is a *really* good point that I hadn't
            considered.
            
            I've read people suggest using a UUIDv7 as the primary key and a
            UUIDv4 as a user-visible one as a remedy.
            
            My first thought when reading the suggestion was, "well but you'll
            still need an index on the v4 IDs, so what does this actually get
            you?" But the answer is that it makes joins less expensive; you
            only require the index once, when constructing the query from the
            user-supplied data, and everything else operates with the
            better-for-performance v7 IDs.
            
            To be clear, in a practical sense, this is a bit of a
            micro-optimization; as far as I understand it, this really only
            helps you by improving the data locality of temporally-related
            items. So, for example, if you had an "order items" table,
            containing rows of a bunch of items in an order, it would speed up
            retrieval times because you wouldn't need to do as many index
            traversals to access all of the items in a particular order. But
            on, say, a users table (where you're unlikely to be querying for
            two different users who happen to have been created at
            approximately the same time), it's not going to help you much. Of
            course the exact same critique is applicable to integer IDs in
            those situations.
            
            Although, come to think of it, another advantage of a user-visible
            v4 with v7 Pk is that you could use a different index type on the
            v4 ID. Specifically, I would think that a hash index for the
            user-visible v4 might be a halfway-decent way to go.
            
            I'm still not sure either way if I like the idea, but it's
            certainly not the craziest thing I've ever heard.
       
              thayne wrote 12 hours 57 min ago:
              I think a bigger benefit from doing that would be that inserts
              would be cheaper. Instead of an expensive insert into the middle
              of an index for every table that needs an index on that key, you
              can do a cheaper insert at the end of the index for all of them
              except for the one that uses uuid4.
              
              But if you are doing that, why not just use an incrementing
              integer instead of a uuidv7?
       
                nbadg wrote 7 hours 34 min ago:
                Certainly for many applications, the autoint approach would be
                fine.
                
                The benefit of uuid in this case is that it allows horizontally
                scalable app servers to construct PKs on their own without risk
                of collisions. In addition to just reducing database load by
                doing the ID generation on the app server (admittedly usually a
                minor benefit), this can be useful either to simplify insert
                queries that span multiple tables with FK relationships
                (potentially saving some round trips in the process) or in very
                niche situations where you have circular dependencies in
                non-nullable FKs (with the constraint deferred until the end of
                the transaction).
       
            cdmckay wrote 1 day ago:
            Out of curiosity, why is it an issue if you leak creation time?
       
              dboreham wrote 1 day ago:
              Apart from all the other answers here: an external entity knowing
              the relative creation time for two different accounts, or just
              that the two accounts were created close in time to each other
              can represent a meaningful information leak.
       
              natch wrote 1 day ago:
              If your system (pseudo-) random number generator (RNG) is
              compromised to derive a portion of its entropy from things that
              are knowable by knowing the time when the function ran, then the
              search space for cracking keys created around the same time can
              be shrunken considerably.
              
              This doesnât even rely on your systemâs built-in RNG being
              low quality. It could be audited and known to avoid such issues
              but you could have a compromised compiler or OS that injects a
              doctored RNG.
       
              saaspirant wrote 1 day ago:
              There was a HN comment about competitors tracking how many new
              signups are happening and increasing the discounts/sales push
              based on that. Something like this.
       
                0x3f wrote 1 day ago:
                That's happening everywhere.  You can order industrial parts
                from a Fortune 500 and check some of the numbers on it too, if
                they're not careful about it.
       
                JetSetIlly wrote 1 day ago:
                In a business I once worked for, one of the users of the online
                ordering system represented over 50% of the business' income,
                something you wouldn't necessarily want them to know.
                
                However, because the online ordering system assigned order
                numbers sequentially, it would have been trivial for that
                company to determine how important their business was.
                
                For example, over the course of a month, they could order
                something at the start of the month and something at the end of
                the month. That would give them the total number of orders in
                that period. They already know how many orders they have placed
                during the month, so company_orders / total_orders =
                percentage_of_business
                
                It doesn't even have to be accurate, just an approximation. I
                don't know if they figured out that they could do that but it
                wouldn't surprise me if they had.
       
                  pezezin wrote 17 hours 27 min ago:
                  This is also something that depends heavily on regulations.
                  In my home country, invoice numbers have to be sequential by
                  law, although you can restart the numbering every year.
       
                    JetSetIlly wrote 7 hours 36 min ago:
                    Yes, even if it's not a legal requirement it's definitely
                    best practice to have sequential invoice numbers. I thought
                    about this at the time but these numbers aren't invoice
                    numbers, only order numbers.
       
                    hexbin010 wrote 13 hours 43 min ago:
                    A global sequence, or sequence-per-account? I suspect the
                    latter?
       
                      pezezin wrote 4 hours 28 min ago:
                      A sequence per "series", where a series can be a fiscal
                      year, a department or category, etc. But I am not sure if
                      you can have one series per customer, I only find
                      conflicting information.
                      
                      You can have more details here, in the section "Complete
                      invoice": [1] [2] (Spanish only)
                      
  HTML                [1]: https://sede.agenciatributaria.gob.es/Sede/en_gb...
  HTML                [2]: https://www.boe.es/buscar/act.php?id=BOE-A-2012-...
       
              Bombthecat wrote 1 day ago:
              Admins, early users, founders, CEOs etc etc would have althe
              lowest creation time...
       
              bruce511 wrote 1 day ago:
              The issue will be very context specific. In other words to
              (reasonably) answer the question, we'd have to judge each
              application individually.
              
              For one example, say you were making voting-booth software. You
              really don't want a (hidden) timestamp attached to each vote
              (much less an incrementing id) because that would break voter
              confidentiality.
              
              More generally, it's more a underlying principle of data
              management. Not leaking ancillary data is easier to justify than
              "sure we leak the date and time of the record creation, but we
              can't think of a reason why that matters."
              
              Personally I think the biggest issue are "clever" programmers who
              treat the uuid as data and start displaying the date and time.
              This leads to complications ("that which is displayed, the
              customer wants to change"). It's only a matter of time before
              someone declares the date "wrong" and it must be "fixed". Not to
              mention time zone or daylight savings conversions.
       
              mort96 wrote 1 day ago:
              Well you're leaking user data. I'm sure you can imagine
              situations where "the defendant created an account on this site
              on this date" could come up. And the user could have created that
              account not knowing that the creation date is public, because
              it's not listed anywhere in the publicly viewable part of the
              profile other than the UUID in the URL.
       
                koakuma-chan wrote 1 day ago:
                Discord is doing fine.
       
                  mort96 wrote 1 day ago:
                  Hacker news is also doing fine, even though I can just click
                  your profile and see you joined in october 2024. It doesn't
                  matter for every use case.
                  
                  But there are cases where it matters. Using UUIDv7 for
                  identifiers means you need to carefully consider the security
                  and privacy implications every time you create a new table
                  identified by a UUID, and you'll possibly end up with some
                  tables where you use v4 and some where you use v7. Worst
                  case, you'll end up with painful migrations from v7 to v4 as
                  security review identifies timestamped identifiers as a
                  security concern.
       
                nish__ wrote 1 day ago:
                Pretty much every social media app has a "Member since X"
                visible on public profiles. I don't think it's an issue.
       
                  mort96 wrote 1 day ago:
                  Who said I was talking about social media?
       
                    nish__ wrote 1 day ago:
                    Well where else do users have public profiles?
       
                      0x3f wrote 1 day ago:
                      The whole point though is that the ID itself leaks info,
                      even if the profile is not public.  There are many cases
                      where you reference an object as a foreign key, even if
                      you can't see the entire record of that foreign key.
       
                        nish__ wrote 22 hours 10 min ago:
                        I can't think of any.
       
                          strbean wrote 18 hours 59 min ago:
                          Sending a friend request is an obvious example.
       
              kreetx wrote 1 day ago:
              E.g, if your service users have timestamp as part of the key and
              this data is visible to other users, you would know when that
              account was created. This could be an issue.
       
              robertlagrant wrote 1 day ago:
              Depends on the data. If you use a primary key in data about a
              person that shouldn't include their age (e.g. to remove age-based
              discrimination) then you are leaking an imperfect proxy to their
              age.
       
                lwhi wrote 1 day ago:
                So the UUID could be used as an imperfect indicator of a
                records created time?
       
                  benterix wrote 1 day ago:
                  UUIDv7 but not UUIDv4.
       
                    lwhi wrote 1 day ago:
                    I suppose timing attacks become an issue too.
       
                      wongarsu wrote 1 day ago:
                      UUIDv7 still have a lot of random bits. Most attacks
                      around creating lots of ids are foiled by that
       
        jwr wrote 1 day ago:
        "if you use PostgreSQL"
        
        (in the scientific reporting world this would be the perennial "in
        mice")
       
          kijin wrote 1 day ago:
          The space requirement and index fragmentation issue is nearly the
          same no matter what kind of relational database you use. Math is
          math.
          
          Just the other day I delivered significant performance gains to a
          client by converting ~150 million UUIDv4 PKs to good old BIGINT. They
          were using a fairly recent version of MariaDB.
       
            splix wrote 1 day ago:
            I think the author means all dbs that fit a single server. Because
            in distributed dbs you often want to spread the load evenly over
            multiple servers.
       
              esafak wrote 1 day ago:
              To spell it out: it improves performance by avoiding hot spots.
       
            zelphirkalt wrote 1 day ago:
            If they can live with making keys only in one place, then sure,
            this can work. If however they need something that is very highly
            likely unique, across machines, without the need to sync, then
            using a big integer is no good.
            
            if they can live with MariaDB, OK, but I wouldn't choose that in
            the first place these days. Likely Postgres will also perform
            better in most scenarios.
       
              kijin wrote 1 day ago:
              Yeah, they had relatively simple requirements so BIGINT was a
              quick optimization. MariaDB can guarantee uniqueness of
              auto-incrementing integers across a cluster of several servers,
              but that's about the limit.
              
              Had the requirements been different, UUIDv7 would have worked
              well, too, because fragmentation is the biggest problem here.
       
          orthoxerox wrote 1 day ago:
          It's not just Postgres or even OLTP. For example, if you have an
          Iceberg table with SCD2 records, you need to regularly locate and
          update existing records. The more recent a record is, the more likely
          it is to be updated.
          
          If you use UUIDv7, you can partition your table by the key prefix.
          Then the bulk of your data can be efficiently skipped when applying
          updates.
       
            andatki wrote 15 hours 4 min ago:
            Good addition!
       
          hyperpape wrote 1 day ago:
          The thing is, none of us are mice, but many of us use Postgres.
          
          It would be the equivalent of "if you're a middle-aged man" or
          "you're an American".
          
          P.S. I think some of the considerations may be true for any system
          that uses B-Tree indexes, but several will be Postgres specific.
       
       
   DIR <- back to front page