_______ __ _______
| | |.---.-..----.| |--..-----..----. | | |.-----..--.--.--..-----.
| || _ || __|| < | -__|| _| | || -__|| | | ||__ --|
|___|___||___._||____||__|__||_____||__| |__|____||_____||________||_____|
on Gopher (inofficial)
HTML Visit Hacker News on the Web
COMMENT PAGE FOR:
HTML Avoid UUID Version 4 Primary Keys in Postgres
tom_m wrote 10 hours 27 min ago:
It depends. What a narrow minded article, sigh. Bring on the vibe
coding! Let's just be mindless.
marcus_holmes wrote 11 hours 11 min ago:
I didn't see my primary use case for UUID's covered: sharing
identifiers across entities is dangerous.
I wrote a CRUD app for document storage. It had user id's and document
id's. I wrote a method GetDocumentForUser(docID, userID) that checked
permissions for that user and document and returned the document if
permitted. I then, stupidly, called that method with
GetDocumentForUser(userID, docID), and it took me a good half hour to
work out why this never returned anything.
It never returned anything because a valid userID will never be a valid
docID. If I had used integers it would have returned documents, and I
probably wouldn't have spotted it while testing, and I would have
shipped a change that cheerfully handed people other people's
documents.
I will put up with a fairly considerable amount of performance hit to
avoid having this footgun lurking. And yes, I know there are other ways
around this (e.g. types) but those come with their own trade-offs too.
Jeff_Brown wrote 12 hours 51 min ago:
If you're really ambitious you'll use two UUIDs for the ID, because for
an app in which at least a billion people have at least 327 million
random v4 UUIDs, the probability of a collision will be greater than
1%.
deepsun wrote 13 hours 24 min ago:
I still don't understand why people don't remove the hyphens from
UUIDs. Hyphens makes it harder to copy-paste IDs. The only reason to
keep them is to make it explicit "hey this is an UUID", otherwise it's
a completely internal affair.
Even worse, some tools generate random strings and then ADD hyphens in
them to look like UUID (even thought it's not, as the UUID version byte
is filled randomly as well), cannot wrap my head why, e.g:
HTML [1]: https://github.com/VictoriaMetrics/VictoriaLogs/blob/v1.41.0/a...
dspillett wrote 15 hours 42 min ago:
> Do not assume that UUIDs are hard to guess; they should not be used
as security capabilities
It is not just about being hard to guess a valid individual identifier
in vacuum. Random (or at least random-ish) values, be they UUIDs or
undecorated integers, in this context are also about it being hard to
guess one from another, or a selection of others.
Wrt: "it isn't x it is y" form: I'm not an LLM, 'onest guv!
moomoo11 wrote 15 hours 58 min ago:
Iâll do it regardless because any time Iâve tried to chase
optimization early like this, hardware has always evolved faster.
We have all faced issues where we donât know where the data will
ultimate live thatâs optimal for our access patterns.
Or we have devices and services doing async operations that need to
sync.
Iâm not working on mission critical âif this fails thereâs a
catastrophic eventâ type shit. Itâs just rent seeking SaaS type
shit.
Oh no it cost $0.35 extra to make $100. Next year will make more money
relative to cost increase.
dizlexic wrote 18 hours 4 min ago:
Meh, uuid go brrrrrrrrrrrrrrrrrrrrrrrrr
Xorakios wrote 18 hours 16 min ago:
I was born in the Vancouver Social Security office #538 area in 1962.
Guess how much I have to defend against attackers trying 538-62-xxxx
mcsoft wrote 18 hours 17 min ago:
Snowflake or sonyflake ids work: [1]
HTML [1]: https://en.wikipedia.org/wiki/Snowflake_ID
HTML [2]: https://github.com/sony/sonyflake?tab=readme-ov-file
burnt-resistor wrote 18 hours 31 min ago:
UUID PKs are trying to solve the wrong problem. Integer/serial primary
keys are not the problem so long as they're never exposed or usable
externally. A critical failure of nearly every RESTful framework is
exposing internal database identifiers rather than using encrypted ones
preserving relative performance, creation order-preservation, and
eliminating unowned key probing.
p0w3n3d wrote 19 hours 48 min ago:
What about newest postgresql support for uuidv7? Anybody did tests?
This is what we're heading towards at the moment of writing so I'd like
to ask to eventually roll back the decision
AtNightWeCode wrote 19 hours 57 min ago:
The counter argument I would say is that having all these integer ids
comes with many problems. You can't make em public cause they leak
info. They are not unique across environments. Meaning you have to spin
up a lot of bs envs to just run it. But retros are for complaining
about test envs, right?
Uuid4 are only 224bits is a bs argument. Such a made up problem.
But a fair point is that one should use a sequential uuid to avoid
fragmentation. One that has a time part.
kgeist wrote 19 hours 31 min ago:
Some additional cases we encounter quite often where UUIDs help:
- A client used to run our app on-premises and now wants to migrate
to the cloud.
- Support engineers want to clone a clientâs account into the dev
environment to debug issues without corrupting client data.
- A client wants to migrate their account to a different region (from
US to EU).
Merging data using UUIDs is very easy because ID collisions are
practically impossible. With integer IDs, we'd need complex and
error-prone ID-rewriting scripts. UUIDs are extremely useful even
when the tables are small, contrary to what the article suggests.
andatki wrote 14 hours 53 min ago:
If merging or moving data between environments is a regular
occurrence, I agree it would be best to have non-colliding primary
keys. I have done an environment move (new DB in different AWS
region) with integers and sequences for maybe a 100 table DB and
itâs do-able but a high cost task. At that company we also had
the demo/customer preview environment concept where we needed to
keep the data but move it.
m000 wrote 20 hours 14 min ago:
Why not just use UUIDs as a unique column next to a bigint PK?
The power and main purpose of UUIDs is to act as easy to produce,
non-conflicting references in distributed settings. Since the scope of
TFA is explicitly set to be "monolithic web apps", nothing stops you
from having everything work with bigint PKs internally, and just add
the UUIDs where you need to provide external references to
rows/objects.
mrkeen wrote 16 hours 47 min ago:
Yes, if you're in the group of developers who are passionate about db
performance, but have ruled out the idea of spreading work out to
multiple DBs, then continuing to use sequential IDs is fine.
kortex wrote 20 hours 28 min ago:
I've been using ULIDs [0] in prod for many years now, and I love them.
I just use string encoding, though if I really wanted to squeeze out
every last MB, I could do some conversion so it is stored as 16 bytes
instead of 26 chars. In practice it's never mattered, and the
simplicity of just string IDs everywhere is nice.
Sometimes I have to talk to legacy systems, all my APIs have str IDs,
and I encode int IDs as just decimal left padded with leading zeros up
to 26 chars. Technically not a compliant ULID but practically speaking,
if I see leading `00` I know it's not an actual ULID, since that would
be before Nov-2004, and ULID was invented in 2017. The ORM
automatically strips the zeros and the query just works.
I'm just kind of over using sequential int IDs for anything bigger than
hobby level stuff. Testing/fixturing/QA are just so much easier when
you do not have to care about whether an ID happens to already exist.
[0]
HTML [1]: https://github.com/ulid/spec
sedatk wrote 20 hours 24 min ago:
See
HTML [1]: https://news.ycombinator.com/item?id=46211578
kortex wrote 15 hours 49 min ago:
The python implementation I use doesn't do this quirk. It's just
timestamp + randomness in Crockford Base32. That's all I need. Sure
it doesn't fully "comply with the spec" but frankly the sequence
sub-millis quirk was a complete mistake.
impoppy wrote 20 hours 30 min ago:
This is such a mediocre article. It provides plenty of valid reasons to
consider avoiding UUID in databases, however, it doesnât say what
should be used should one want primary keys that are not easy to
predict. The XOR alternative is too primitive and, well, whereas I get
why should I consider avoiding UUID, then what should I use instead?
p1necone wrote 20 hours 40 min ago:
Being able to create something and know the id of it before waiting for
an http round trip simplifies enough code that I think UUIDs are worth
it for me. I hadn't really considered the potential perf optimization
from orderable ids before though - I will consider UUID v7 in future.
andatki wrote 15 hours 1 min ago:
Great!
miiiiiike wrote 20 hours 45 min ago:
The article is muddled, I wish he'd split it into two. One for UUID4
and another for UUID7.
I was using 64-bit snowflake pks
(timestamp+sequence+random+datacenter+node) previously and made the
switch to UUID7 for sortable, user-facing, pks. I'm more than fine
letting the DB handle a 128-bit int vs over a 64-bit int if it means
not having make sure that the latest version of my snowflake function
has made it to every db or that my snowflake server never hiccups,
ever.
Most of the data that's going to be keyed with a uuid7 is getting
served straight out of Redis anyway.
maxnoe wrote 20 hours 59 min ago:
Would this argument still apply if I need to store a uuidv4 anyway in
the table?
And I'd likely want a unique constraint on that?
phendrenad2 wrote 21 hours 2 min ago:
You probably don't want integer primary keys, and you probably don't
want UUID primary keys. You probably want something in-between,
depending on your use case. UUID is one extreme on this spectrum, which
tries to solve all of the problems, including ones you might not have.
pil0u wrote 20 hours 45 min ago:
What's in-between? I posted the article because I'm in the middle of
that choice and wanted to generate discussion/contradiction.
So far, people have talked a lot about UUIDs, so I'm genuinely
curious about what's in-between.
phendrenad2 wrote 12 hours 45 min ago:
An example would be YouTube's video IDs. It's custom-fit for a
purpose (security: no, avoiding the problem where people fish for
auspicious YouTube video numbers or something: yes).
Another example would be a function that sorts the numbers 0
through 999 in a seemingly random order (but's actually
deterministic), and then repeat that for each block of 1000 with a
slight shift. Discourages casual numeric iteration but isn't as
complex or cryptographically secure as UUID.
can3p wrote 21 hours 18 min ago:
> For many business apps, they will never reach 2 billion unique values
per table, so this will be adequate for their entire life. Iâve also
recommended always using bigint/int8 in other contexts.
I'm sure every dba has a war story that starts with similar decision in
the past
Waterluvian wrote 21 hours 44 min ago:
Any decent resources with benchmark data on Postgres insertion,
indexing, retrieve, etc. for UUID vs. integer based PKs?
whateveracct wrote 21 hours 52 min ago:
Sometimes its nice for your PK to be uniformly distributed. As a
reader, even if it hurts as a writer. For instance, you can easily
shard queries and workloads.
> the impact to inserts and retrieval of individual items or ranges of
values from the index.
Classic OLTP vs OLAP.
iainctduncan wrote 22 hours 16 min ago:
Counterargument... I do technical diligence so I talk to a lot of
companies at points of inflection, and I also talk to lots who are
stuck.
The ability to rapidly shard everything can be extremely valuable. The
difference between "we can shard on a dime" and "sharding will take a
bunch of careful work" can be expensive If the company has poor
margins, this can be the difference between "can scale easily" and
"we're not getting this investment".
I would argue that if your folks have the technical chops to be able to
shard while avoiding surrogate guaranteed unique keys, great. But if
they don't.... a UUID on every table can be a massive get-out-of-jail
free card and for many companies this is much, much important than some
minor space and time optimizations on the DB.
Worth thinking about.
mrinterweb wrote 16 hours 51 min ago:
I can see how sharding could be difficult with a bigint FK, but
UUIDv7 would still play nice, if I understand your point correctly.
Monotonically increasing foreign keys have performance benefits over
random UUIDv4 FKs in postgresql is the point of the article.
jessep wrote 20 hours 42 min ago:
Sort of related, but we had to shard as usage grew and didnât have
uuids and it was annoying. Wasnât the most annoying bit though.
Whole thing is pretty complex regardless of uuid, if you have a
highly interconnected data model that needs to stay online while
migrating.
iainctduncan wrote 20 hours 40 min ago:
Right, but if you start off with uuids and the expectation that you
might use them to shard, you'll wind up factoring that into the
data model. Retrofitting, as you rightly say, can be much harder.
FancyFane wrote 22 hours 24 min ago:
Even MySQL benefits from these changes as well. What we're really
discussing is random primary key inserts (UUIDv4) vs incrementing
primary key inserts (UUIDv6 or v7).
PlanetScale wrote up a really good article on why incrementing primary
keys are better for performance when compared to randomly inserted
primary keys; when it comes to b-tree performance.
HTML [1]: https://planetscale.com/blog/btrees-and-database-indexes
sneak wrote 23 hours 27 min ago:
This is why ULID exists and why I use them in my ext_id columns. For
the actual relational IDs internal to the db I use smaller/faster data
types.
henning wrote 23 hours 31 min ago:
UUIDs make enumeration attacks harder and also prevent situations where
seeing a high valid ID value lets you estimate how much money a private
company is earning if they charge based on the object the ID is
associated with. If you can sample enough object ID values and see when
the IDs were created, you could reverse engineer their ARR chart and
see whether they're growing or not which many companies want to avoid.
hippo22 wrote 23 hours 36 min ago:
The author should include benchmarks otherwise, saying that UUIDs
âincrease latencyâ is meaningless. For instance, how much longer
does it take to insert a UUID vs. an integer? How much longer does
scanning an index take?
jakeydus wrote 23 hours 19 min ago:
The author doesn't reference any tests that they themselves ran, but
they did link a cybertec article [0] with some benchmarks.
[0]
HTML [1]: https://www.cybertec-postgresql.com/en/unexpected-downsides-...
deathanatos wrote 23 hours 51 min ago:
> Are UUIDs secure?
> Misconceptions: UUIDs are secure
> One misconception about UUIDs is that theyâre secure. However, the
RFC describes that they shouldnât be considered secure
âcapabilities.â
> From RFC 41221 Section 6 Security Considerations:
> Do not assume that UUIDs are hard to guess; they should not be used
as security capabilities
This is just wrong, and the citation doesn't support it. You're not
guessing a 122-bit long random identifier. What's crazy is that the
article, immediately prior to this, even cites the very math involved
in showing exactly how unguessable that is.
⦠the linked citation (to §4.4, which is different from the in-prose
citation) is just about how to generate a v4, and completely unrelated
to the claim. The prose citation to §6 is about UUIDs generally: the
statement "Do not assume that [all] UUIDs are hard to guess" is not
logically inconsistent with properly-generated UUIDv4s being hard to
guess. A subset of UUIDs have security properties, if the system
generating & using them implements those properties, but we should not
assume all UUIDs have that property.
Moreover, replacing an unguessable UUID with an (effectively random)
32-bit integer does make it guessable, and the scheme laid out seems
completely insecure if it is to be used in the contexts one finds
UUIDv4s being an unguessable identifier.
The additional size argument is pretty weak too; at "millions of rows",
a UUID column is consuming an additional ~24 MiB.
stickfigure wrote 1 day ago:
This is incredibly database-specific. In Postgres random PKs are bad.
But in distributed databases like Cockroach, Google Cloud Datastore,
and Spanner it is the opposite - monotonic PKs are bad. You want to
distribute load across the keyspace so you avoid hot shards.
nonethewiser wrote 13 hours 29 min ago:
>In Postgres random PKs are bad
Why?
whizzter wrote 15 hours 27 min ago:
Even in a distributed database you want increasing (even if not
monotonic) keys since the underlying b-tree or whatever will very
likely behave badly for entirely random data.
UUIDv7 is very useful for these scenarios since
A: A hash or modulus of the key will be practically random due to the
lower bits being random or pseudo-random (ie distributes well between
nodes)
B: the first bits are sortable.. thus the underlying storage on each
node won't go bananas.
dspillett wrote 15 hours 38 min ago:
I wouldn't say it is incredibly database specific, it is more
database type specific. For most general, non-sharded, databases,
random key values can be a problem as they lead to excess
fragmentation in b-trees and similar structures.
PunchyHamster wrote 16 hours 32 min ago:
UUIDv7 is not monotonic tho
HendrikHensen wrote 19 hours 46 min ago:
In Google Cloud Bigtable we had the issue that our domain's primary
key was a sequential integer autogenerated by another app. So we just
reversed it, and it distributed automatically quite nicely.
bruckie wrote 19 hours 41 min ago:
Doing this (reversing the bits of sequential ids) is common
guidance inside Google for its distributed datastores like Spanner
and Bigtable.
hedora wrote 19 hours 50 min ago:
It's also application specific. If you have workload that's write
heavy, has temporal skew and is highly concurrent, but rarely creates
new records, you're probably better off with a random PK, even in PG.
kortex wrote 20 hours 45 min ago:
As long as the key has sufficient entropy (i.e. not monotonic
sequential ints), that ensures the keyspace is evenly distributed,
correct? So UUID>=v4, ULID, KSUID, possibly snowflake, should be fine
for the sake of even distribution of the hashes.
nutjob2 wrote 22 hours 16 min ago:
> You want to distribute load across the keyspace so you avoid hot
shards.
This is just another case of keys containing information and is not
smart.
The obvious solution is to have a field that drives distribution,
allowing rebalancing or whatever.
stickfigure wrote 21 hours 44 min ago:
This is something you should discuss with the developers at
Cockroach Labs, Google Cloud, et al.
As a consumer of these databases we're stuck with them as designed,
which means we have to worry about key distribution.
conradludgate wrote 22 hours 6 min ago:
100%. You can use rendezvous hashing to determine the shard(s). The
hash of a sequence should be randomly distributed as changing the
LSB should propagate to 50% change in the output bits.
dap wrote 22 hours 47 min ago:
It is, although you can have sharded PostgreSQL, in which case I
agree with your assessment that you want random PKs to distribute
them.
It's workload-specific, too. If you want to list ranges of them by
PK, then of course random isn't going to work. But then you've got
competing tensions: listing a range wants the things you list to be
on the same shard, but focusing a workload on one shard undermines
horizontal scale. So you've got to decide what you care about (or do
something more elaborate).
jakeydus wrote 23 hours 21 min ago:
I think they address this in the article when they say that this
advice is specific to monolithic applications, but I may be
misremembering (I skimmed).
andy_ppp wrote 23 hours 16 min ago:
Are you saying a monolith cannot use a distributed database?
jakeydus wrote 23 hours 12 min ago:
I'm not making any claims at all, I was just adding context from
my recollection of the article that appeared to be missing from
the conversation.
Edit: What the article said:
> The kinds of web applications Iâm thinking of with this post
are monolithic web apps, with Postgres as their primary OLTP
database.
So you are correct that this does not disqualify distributed
databases.
cruffle_duffle wrote 1 day ago:
I never understood the arguments against using using globally unique
ids. For example how it somehow messes up indexes. Iâm not a CS major
but those are typically b-trees are they not? If you have a primary key
whose generation is truly random such that each number is equally
likely, then that b-tree is always going to be balanced.
Yes there are different flavors of generating them with their own pros
and cons, but at the end of the day itâs just so much more elegant
than some auto incrementing crap your database creates. But that is
just semantic, you can always change the uuid algorithm for future
keys. And honestly if you treat the uuid as some opaque entity (which
you should), why not just pick the random one?
And I just thought of the argument that âbut what if you want to sort
the uuidâ¦â say itâs used for a list of stories or something?
Well, again⦠if you treat the uuid as opaque why would you sort it?
You should be sorting on some other field like the date field or title
or something. UUIDs are opaque, damn it. You donât sort opaque data.
âWell they get clustered weirdâ say people. Why are you clustering
on a random opaque key? If you need certain data to be clustered, then
do it on the right key (user_id field did your data was to be clustered
by user, say)
Letting the client generate the primary keys is really liberating. Not
having to care about PK collisions or leaking information via auto
incrementing numbers is great!
In my opinion uuid isnât used enough!
chuckadams wrote 20 hours 39 min ago:
> If you have a primary key whose generation is truly random such
that each number is equally likely, then that b-tree is always going
to be balanced.
Balanced and uniformly scattered. A random index means fetching a
random page for every item. Fine if your access patterns are truly
random, but that's rarely the case.
> Why are you clustering on a random opaque key?
InnoDB clusters by the PK if there is one, and that can't be changed
(if you don't have a PK, you have some options, but let's assume you
have one). MSSQL behaves similarly, but you can override it. If
your PK is random, your clustering will be too. In Postgres, you'll
just get fragmented indexes, which isn't quite as bad, but still
slows down vacuum. Whether that actually becomes a problem is also
going to depend on access patterns.
One shouldn't immediately freak out over having a random PK, but
should definitely at least be aware of the potential degradation they
might cause.
cruffle_duffle wrote 14 hours 23 min ago:
I feel, honestly, like while you are indeed correct for most cases
itâs absolutely fine to use some flavor of uuid. I feel like the
benefits outweighs the cost in most cases.
chuckadams wrote 1 hour 10 min ago:
Sure, and for many cases, uuidv7 is that flavor. It just comes
with a timestamp, which may or may not be an issue. It isn't an
issue for me, which is why I use it myself.
nesarkvechnep wrote 1 day ago:
If we embraced REST, as Roy Fielding envisioned it, we wouldn't have
this, and all similar, conversations. REST doesn't expose identifier,
it only exposes relationships. Identifiers are an implementation
details.
aynyc wrote 1 day ago:
I've seen this type of advice a few times now. Now I'm not a database
expert by any stretch of imagination, but I have yet to see UUID as
primary key in any of the systems I've touched.
Are there valid reasons to use UUID (assuming correctly) for primary
key? I know systems have incorrectly expose primary key to the public,
but assuming that's not the concern. Why use UUID over big-int?
rcxdude wrote 1 day ago:
Uuids also allow the generation of the ID to seperate from the
insertion into the database, which can be useful in distributed
systems.
cruffle_duffle wrote 1 day ago:
I mean this is the primary reason right here! You can pre-create
an entire tree of relationships client side and ship it off to the
database with everything all nice and linked up. And since by
design each PK is globally unique youâll never need to worry
about constraint violations. Itâs pretty damn nice.
assbuttbuttass wrote 1 day ago:
At least for the Spanner DB, it's good to have a randomly-distributed
primary key since it allows better sharding of the data and avoids
"hot shards" when doing a lot of inserts. UUIDv4 is the typical
solution, although a bit-reversed incrementing integer would work too
HTML [1]: https://cloud.google.com/blog/products/databases/announcing-...
ellisv wrote 1 day ago:
About 10 years ago I remember seeing a number of posts saying "don't
use int for ids!". Typically the reasons were things like "the id
exposes the number of things in the database" and "if you have bad
security then users can increment/decrement the id to get more
data!". What I then observed was a bunch of developers rushing to use
UUIDs for everything.
UUIDv7 looks really promising but I'm not likely to redo all of our
tables to use it.
andatki wrote 15 hours 30 min ago:
Note that if youâre using UUID v4 now, switching to v7 does not
require a schema migration. Youâd get the benefits when working
with new records, for example reduced insert latency. The uuid data
type supports both.
dangets wrote 23 hours 52 min ago:
You can use the same techniques except with the smaller int64 space
- see e.g. Snowflake ID -
HTML [1]: https://en.wikipedia.org/wiki/Snowflake_ID
reffaelwallen wrote 1 day ago:
At my company we only use UUIDs as PKs.
Main reason I use it is the German Tank problem: [1] (tl;dr; prevent
someone from counting how many records you have in that table)
HTML [1]: https://en.wikipedia.org/wiki/German_tank_problem
infragreen wrote 22 hours 49 min ago:
This was a great read, thank you for sharing!
andatki wrote 15 hours 30 min ago:
Appreciate it!
jakeydus wrote 23 hours 13 min ago:
I'm new to the security side of things; I can understand that
leaking any information about the backend is no bueno, but why
specifically is table size an issue?
boruto wrote 22 hours 43 min ago:
In my old company new joiners are assigned an monotonic number as
id in tech. GitHub profile url reflected that.
Someone may or may not have used the pattern to get to know the
attrition rate through running a simple script every month))
littlestymaar wrote 1 day ago:
What stops you from having another uuid field as publicly visible
identifier (which is only a concern for a minority of your tables).
This way you avoid most of the issues highlighted in this article,
without compromising your confidential data.
mgoetzke wrote 1 day ago:
That depends a lot on many factors and thus I dont like generic
statements like that which tend to be more focused on a specific
database pattern. That said everyone should indeed be aware of the
potential tradeoffs.
And of course we could come up with many ways to generate our own ids
and make them unique, but we have the following requirements.
- It needs to be a string (because we allow composing them to 'derive'
keys)
- A client must be able to create them (not just a server) without risk
for collisions
- The time order of keys must not be guessable easily (as the id is
often leaked via references which could 'betray' not just the existence
of a document, but also its relative creation time wrt others).
- It should be easy to document how any client can safely generate
document ids.
The lookup performance is not really such a big deal for us. Where it
is we can do a projection into a more simple format where applicable.
gwbas1c wrote 1 day ago:
I work on an application where we encrypt the integer primary key and
then use the bytes to generate something that looks like a UUID.
In our case, we don't want database IDs in an API and in URLs. When IDs
are sequential, it enables things like dictionary attacks and provides
estimates about how many customers we have.
Encrypting a database ID makes it very obvious when someone is trying
to scan, because the UUID won't decrypt. We don't even need a database
round trip.
sedatk wrote 20 hours 57 min ago:
That sounds quite troublesome if the encryption key is lost,
compromised, or rotated for any other reason.
cowboylowrez wrote 14 hours 16 min ago:
Key management seems to be as important as backups but I understand
that something so small (an encryption key) could seem unimportant
because database backups are so big lol but they really do share
important attributes (do not lose your keys, do not lose your data,
do not expose your keys, do not expose your data, etc etc)
gwbas1c wrote 15 hours 57 min ago:
The key will never be rotated. (There's no reason to.)
We're not worried about key compromises.
If the key is lost, we have much bigger problems.
puilp0502 wrote 1 day ago:
Few questions:
* How do you manage the key for encrypting IDs? Injected to app
environment via envvar? Just embedded in source code? I ask this
because I'm curious as to how much "care" I should be putting in into
managing the secret material if I were to adopt this scheme.
* Is the ID encrypted using AEAD scheme (e.g. AES-GCM)? Or does the
plain AES suffice? I assume that the size of IDs would never exceed
the block size of AES, but again, I'm not a cryptographer so not sure
if it's safe to do so.
gwbas1c wrote 15 hours 50 min ago:
> How do you manage the key for encrypting IDs?
The same way we manage all other secrets in the application.
(Summarized below)
> Is the ID encrypted using AEAD scheme (e.g. AES-GCM)? Or does the
plain AES suffice? I assume that the size of IDs would never exceed
the block size of AES, but again, I'm not a cryptographer so not
sure if it's safe to do so.
I don't have the source handy at the moment. It's one of the easier
to use symmetric algorithms available in .Net. We aren't talking
military-grade security here. In general: a 32-bit int encrypts to
64-bits, so we pad it with a few unicode characters so it's 64-bits
encrypted to 128 bits.
---
As far as managing secrets in the application: We have a homegrown
configuration file generator that's adapted to our needs. It
generates both the configuration files, and strongly-typed classes
to read the files. All configuration values are loaded at startup,
so we don't have to worry about runtime errors from missing
configuration values.
Secrets (connection strings, encryption keys, ect,) are encrypted
in the configuration file as base64 strings. The certificate to
read/write secrets are stored in Azure Keyvault.
The startup logic in all applications is something like:
1: Determine the environment (production, qa, dev)
2: Get the appropriate certificate
3: Read the configuration files, including decrypting secrets (such
as the primary key encryption keys) from the configuration files
4: Populate the strongly-typed objects that hold the configuration
values
5: These objects are dependency-injected to runtime objects
stevefan1999 wrote 1 day ago:
Try Snowflake ID then:
HTML [1]: https://en.wikipedia.org/wiki/Snowflake_ID
gizzlon wrote 23 hours 37 min ago:
Leaks time and order, just like uuid v7.
0x457 wrote 21 hours 7 min ago:
Kinda the point? Ordered PKs give you a better index.
p2detar wrote 1 day ago:
Another interesting article from Feb-2024 [0] where the cost of
inserting a uuid7() and a bigint is basically the same. To me it wasn't
quite clear what the problem with the buffer cache is but the author
makes it much more clear than OP's article:
> We need to read blocks from the disk when they are not in the
PostgreSQL buffer cache. Conveniently, PostgreSQL makes it very easy to
inspect the contents of the buffer cache. This is where the big
difference between uuidv4 and uuidv7 becomes clear. Because of the lack
of data locality in uuidv4 data, the primary key index is consuming a
huge amount of the buffer cache in order to support new data being
inserted â and this cache space is no longer available for other
indexes and tables, and this significantly slows down the entire
workload.
0 -
HTML [1]: https://ardentperf.com/2024/02/03/uuid-benchmark-war
mkleczek wrote 1 day ago:
That's really an important deficiency of Postgres.
Hash index is ideally suited for UUIDs but for some reason Postgres
hash indexes cannot be unique.
mexicocitinluez wrote 1 day ago:
You'll have to rip the ability to generate unique numbers from quite
literally anywhere in my app and save them without conflict from my
cold, dead hands.
The ability to know ahead of time what a primary key will be (in lieu
of persisting it first, then returning) opened up a whole new world of
architecting work in my app. It made a lot of previous awkward things
feel natural.
sgarland wrote 1 day ago:
Sounds like a lot of referential integrity violations.
mexicocitinluez wrote 1 day ago:
Why would generating a PK ahead of time cause referential integrity
violations? Super curious to find out.
sgarland wrote 1 day ago:
The implication is that you need to know the PK ahead of time so
that you can insert it into other tables which reference it as an
FK without waiting for it to be returned, which further implies
that you donât have FK constraints, because the DB would
disallow this.
Tbf in Postgres, you can declare FKs to be deferrable, so their
existence is checked at transaction commit, rather than at
insertion time.
If you donât have the DB enforcing referential integrity, you
need to be extremely careful in your application logic; IME, this
inevitably fails. At some point, someone writes bad code, and you
get data anomalies.
mexicocitinluez wrote 23 hours 22 min ago:
> Tbf in Postgres, you can declare FKs to be deferrable, so
their existence is checked at transaction commit, rather than
at insertion time.h further implies that you donât have FK
constraints, because the DB would disallow this.
I'm using EF core which hooks up these relationships and allows
me to persist them in a single transaction using MSSQL server.
> If you donât have the DB enforcing referential integrity
I'm building an electronic medical system. I'm well aware of
the benefits of referential integrity.
old8man wrote 1 day ago:
Very useful article, thank you! Many people suggest CUID2, but it is
less efficient and is better used for frontend/url encoding. For
backend/db, only UUID v7 should be used.
kaladin_1 wrote 1 day ago:
I really hoped the author would discuss alternatives for distributed
databases that writes in parallel. Sequential key would be atrocious in
such circumstance this could kill the whole gain of distributed
database as hotspots would inevitably appear.
I would like to hear from others using, for example, Google Spanner, do
you have issues with UUID. I don't for now, most optimizations happen
at the Controller level, data transformation can be slow due to
validations. Try to keep service logic as straightforward as possible.
grugdev42 wrote 1 day ago:
A much simpler solution is to keep your tables as they are (with an
integer primary key), but add a non sequential public identifier too.
id => 123,
public_id => 202cb962ac59075b964b07152d234b70
There are many ways to generate the public_id. A simple MD5 with a salt
works quite well for extremely low effort.
Add a unique constraint on that column (which also indexes it), and
you'll be safe and performant for hundreds of millions of rows!
Why do we developers like to overcomplicate things? ;)
Retr0id wrote 1 day ago:
Per [1] , if you use a block cipher rather than a hash then you don't
even need to store it anywhere.
HTML [1]: https://news.ycombinator.com/item?id=46273325
Denvercoder9 wrote 1 day ago:
This misses the point. The reason not to use UUIDv4 is that having an
index on random values is slow(er), because sequential inserts into
the underlying B-tree are faster than random inserts. You're hitting
the same problem with your `public_id` column, that it's not the
primary key doesn't change that.
sgarland wrote 1 day ago:
For InnoDB-based DBs that are not Aurora, and if the secondary
index isnât UNIQUE, it solves the problem, because secondary
non-unique index changes are buffered and written in batches to
amortize the random cost. If youâre hashing a guaranteed unique
entity, Iâd argue you can skip the unique constraint on this
index.
For Aurora MySQL, it just makes it worse either way, since
thereâs no change buffer.
hnfong wrote 1 day ago:
Ints as pk would be quicker for joins etc though.
grugdev42 wrote 4 hours 31 min ago:
Exactly
hk1337 wrote 1 day ago:
My biggest thing for UUIDs is donât UUID everything. Most things
should be okay with just regular integers as PKs.
BartjeD wrote 1 day ago:
Personally my approach has been to start with big-ints and add a GUID
code field if it becomes necessary. And then provide imports where you
can match objects based on their code, if you ever need to
import/export between tenants, with complex object relationships.
But that also adds complexity.
parpfish wrote 1 day ago:
Two things I donât like about big-int indexes:
- If you use uuids as foreign keys to another table, itâs obvious
when you screw up a join condition by specifying the wrong indices.
With int indices you can easily get plausible looking results because
your join will still return a bunch of data
- if youâre debugging and need to search logs, having a simple uuid
string is nice for searching
ivan_gammel wrote 1 day ago:
The is article is about a solution in search of a problem, a classic
premature optimization issue. UUIDv4 is perfectly fine for many use
cases, including small databases. Performance argument must be
considered when thereâs a problem with performance on the horizon.
Other considerations may be and very often superior to that.
sgarland wrote 1 day ago:
IME, when performance issues become obvious, the devs are in growth
mode and have no desire / time to revisit PK choice.
Integer PKs were seen as fine for years - decades, even - before the
rise of UUIDs.
sagarm wrote 1 day ago:
It's not really feasible to rekey your UUIDv4 keyed database to
int64s after the fact, imo. Sure your new tables could be
integer-keyed, but the bulk of your storage will be UUID (and UUIDv4,
if that's what you started with) for a very long time
mrkeen wrote 17 hours 0 min ago:
If you have too many UUIDs, throw more DBs at the problem.
Hopefully you haven't tied yourself to any architectural decisions
that would limit you to only using a single database.
ivan_gammel wrote 1 day ago:
Yes, sure. My point is, it may never be necessary.
sagarm wrote 23 hours 7 min ago:
I think you're right that it won't matter for most companies. But
having been at a company with persistent DB performance issues
with UUIDv4 keys as a contributing factor, it sucks.
vintermann wrote 1 day ago:
A prime example of premature optimization.
Permanent identifiers should not carry data. This is like the cardinal
sin of data management. You always run into situations where the thing
you thought, "surely this never changes, so it's safe to squeeze into
the ID to save a lookup". Then people suddenly find out they have a new
gender identity, and they need a last final digit in their ID numbers
too.
Even if nothing changes, you can run into trouble. Norwegian PNs have
your birth date (in DDMMYY format) as the first six digits. Surely that
doesn't change, right? Well, wrong, since although the date doesn't
change, your knowledge of it might. Immigrants who didn't know their
exact date of birth got assigned 1. Jan by default... And then people
with actual birthdays on 1 Jan got told, "sorry, you can't have that as
birth date, we've run out of numbers in that series!"
Librarians in the analog age can be forgiven for cramming data into
their identifiers, to save a lookup. When the lookup is in a physical
card catalog, that's somewhat understandable (although you bet they
could run into trouble over it too). But when you have a powerful
database at your fingertips, use it! Don't make decisions you will
regret just to shave off a couple of milliseconds!
maxbond wrote 16 hours 33 min ago:
> Permanent identifiers should not carry data.
Do you have the same criticism for serial identifiers? How about
hashes? What about the version field in UUIDs?
danudey wrote 16 hours 36 min ago:
Perhaps you can clarify something for me, because I think I'm missing
it.
> Norwegian PNs have your birth date (in DDMMYY format) as the first
six digits
So presumably the format is DDMMYYXXXXX (for some arbitrary number of
X's), where the XXX represents e.g. an automatically incrementing
number of some kind?
Which means that if it's DDMMYYXXX then you can only have 1000 people
born on DDMMYY, and if it's DDMMYYXXXXX then you can have 100,000
people born on DDMMYY.
So in order for there to be so many such entries in common that
people are denied use of their actual birthday, then one of the
following must be true:
1. The XXX counter must be extremely small, in order for it to run
out as a result of people 'using up' those Jan 1 dates each year
2. The number of people born on Jan 1 or immigrating to Norway
without knowledge of their birthday must be colossal
If it was just DDMMXXXXX (no year) then I can see how this system
would fall apart rapidly, but when you're dealing with specifically
"people born on Jan 1 2014 or who immigrated to Norway and didn't
know their birthday and were born on/around 2014 so that was the year
chosen" I'm not sure how that becomes a sufficiently large number to
cause these issues. Perhaps this only occurs in specific years where
huge numbers of poorly-documented refugees are accepted?
(Happy to be educated, as I must be missing something here)
PunchyHamster wrote 16 hours 43 min ago:
The cause is more just "not having enough bits". UUID is 128 bit.
You're not running out even if you use part for timestamp, the random
part will be big enough.
Like, it's a valid complaint.. just not for discussion at hand.
Also, we do live in reality and while having entirely random one
might be perfect from theory of data, in reality having it be
prefixed by date have many advantages performance wise.
> Permanent identifiers should not carry data. This is like the
cardinal sin of data management
As long as you don't use the data and have actual fields for what's
also encoded in UUID, there is absolutely nothing wrong with it,
provided there is enough of the random part to get around artifacts
in real life data.
asah wrote 18 hours 3 min ago:
counterpoint: IRL, data values in a system like PostgreSQL are padded
to word boundaries so either you're wasting bits or "carrying data."
cycomanic wrote 21 hours 7 min ago:
Like the other poster said, this is a problem with default values not
encoding the birthday into the personnummer.
I think it also is important to remember the purpose of specific
numbers. For instance I would argue a PN without the birthday would
be strictly worse. With the current system (I only know the Swedish
one, but assume it's the same) I only have to remember a 4 digit
(because the number is bdate + unique 4 digits). If we would instead
use completely random numbers I would have to remember at least an 8
digit number (and likely to be future proof you'd want at least 9
digits). Sure that's fine for myself (although I suspect some people
already struggle with it), but then I also have to remember the
numbers for my 2 kids and my partner and things become quickly
annoying. Especially, because one doesn't use the numbers often
enough that it becomes easy, but still often enough that it becomes
annoying to look up, especially when one doesn't always cary their
phone with them.
guhcampos wrote 20 hours 43 min ago:
It's not that bad. Brazilian CPF are 11 numbers and everyone
remembers them. You just get use to it =)
GuB-42 wrote 21 hours 10 min ago:
I don't think the timestamped UUIDs are "carrying data", it is just a
heuristic to improve lookup performance. If the timestamp is wrong,
it will just run as slow as the non-timestamped UUID.
If you take the gender example, for 99% of people, it is male/female
and it won't change, and you can use that for load balancing. But if
later, you found out that the gender is not the one you expect for
that bucket, no big deal, it will cause a branch misprediction, but
instead of happening 50% of the times when you use a random value, it
will only happen 1% of the times, significant speedup with no loss in
functionality.
delecti wrote 18 hours 59 min ago:
As soon as you encode imperfect data in an immutable key, you
always have to check when you retrieve it. If that piece of data
isn't absolutely 100% guaranteed to be perfect, then you have to
query both halves of the load balanced DB anyway.
moralestapia wrote 20 hours 52 min ago:
>and you can use that for load balancing
As long as you're not in China or India around specific years ...
GP's point stands strong.
mzi wrote 21 hours 26 min ago:
> Even if nothing changes, you can run into trouble. Norwegian PNs
have your birth date (in DDMMYY format) as the first six digits.
Surely that doesn't change, right?
I guess that Norway has solved it in the same or similar way as
Sweden? So a person is identified by the PNR and for those systems
that need to track a person over several PNR (government agencies)
use PRI. And a PRI is just the first PNR assigned to a person with a
1 inserted in the middle. If that PRI is occupied, use a 2,and so on.
PRI could of course have been a UUID instead.
ralferoo wrote 22 hours 49 min ago:
> Norwegian PNs have your birth date (in DDMMYY format) as the first
six digits. Surely that doesn't change, right? Well, wrong, since
although the date doesn't change, your knowledge of it might.
Immigrants who didn't know their exact date of birth got assigned 1.
Jan by default... And then people with actual birthdays on 1 Jan got
told, "sorry, you can't have that as birth date, we've run out of
numbers in that series!"
To me, what your example really shows is the problem with incorrect
default values, not a problem with encoding data into a key per se.
If they'd chosen a non-date for unknown values, maybe 00 or 99 for
day or month components, then the issue you described would
disappear.
But in case, the intention for encoding a timestamp into a UUID isn't
for any implied meaning. It's both to guarantee uniqueness with a
side effect that IDs are more or less monotonically increasing.
Whether this is actually desirable depends on your application, but
generally if the application is as a indexed key for insertion into a
database, it's usually more useful for performance than a fully
random ID as it avoids rewriting lots of leaf-nodes of B-trees. If
you insert a load of these such keys, it forms a cluster on one side
of the tree that can the rebalance with only the top levels needing
to be rewritten.
PunchyHamster wrote 16 hours 47 min ago:
> To me, what your example really shows is the problem with
incorrect default values, not a problem with encoding data into a
key per se. If they'd chosen a non-date for unknown values, maybe
00 or 99 for day or month components, then the issue you described
would disappear.
well, till you run out of numbers for the immigrants that don't
have exact birth date
croes wrote 20 hours 17 min ago:
> If they'd chosen a non-date for unknown values, maybe 00 or 99
for day or month components, then the issue you described would
disappear
> Norwegian PNs have your birth date (in DDMMYY format) as the
first six digits.
You can already feel the disaster rising because sone program
expects always the latter.
And it doesnât fix the problem, it just makes it less likely.
nonethewiser wrote 20 hours 40 min ago:
>To me, what your example really shows is the problem with
incorrect default values, not a problem with encoding data into a
key per se. If they'd chosen a non-date for unknown values, maybe
00 or 99 for day or month components, then the issue you described
would disappear.
You still have that problem from organic birthdays and also the
problem of needing to change ids to correct birth dates.
jonny_eh wrote 18 hours 14 min ago:
To add to that, birthdays can clump, just like any seemingly
"random" data.
Dylan16807 wrote 16 hours 44 min ago:
Not significantly. For actual births, a couple holidays have
very low rates but clumping into much higher rates happens on
no dates.
lovich wrote 17 hours 29 min ago:
A million dots scattered randomly over a graph can all land on
the exact same coordinate if itâs truly random.
What most people intuit as random is some sort of noise
function that is generally dispersed and doesnât trigger the
pattern matching part of their brain
Dylan16807 wrote 16 hours 41 min ago:
> A million dots scattered randomly over a graph can all land
on the exact same coordinate if itâs truly random.
It won't happen though. 0.00000000% chance it happens even
once in a trillion attempts.
> What most people intuit as random is some sort of noise
function that is generally dispersed and doesnât trigger
the pattern matching part of their brain
Yes, people intuit the texture of random wrong in a situation
where most buckets are empty. But when you have orders of
magnitude more events than buckets, that effect doesn't
apply. You get pretty even results that people expect.
lovich wrote 15 hours 35 min ago:
> It won't happen though. 0.00000000% chance it happens
even once in a trillion attempts.
It has the same odds as any other specific configuration of
randomly assigned dots. The overly active human pattern
matching behavior is the only reason it would be treated as
special.
FartinMowler wrote 13 hours 2 min ago:
Lol, reminds me of a story: at his workplace my brother
was invited to join a lottery ticket pool where each got
to pick the numbers for a ticket. The numbers he picked
were 1-2-3-4-5-6. Although the others, mostly fellow
engineers, reluctantly agreed his numbers were as likely
as the others, after a couple of weeks they neglected to
invite him again.
coldtea wrote 14 hours 29 min ago:
>It has the same odds as any other specific configuration
of randomly assigned dots
Which doesn't change anything in practice, since it
having "the same odds as any other specific
configuration" ignores the fact that more scattered
configurations are still far more numerous than it (or
even from ones with more visual order in general) taken
all together.
>The overly active human pattern matching behavior is the
only reason it would be treated as special.
Nope, it's also the fact that it is ONE configuration,
whereas all the rest are much much larger number. That's
enough to make this specific configuration ultra rare in
comparison (since we don't compare it to each other but
to all others put together).
Dylan16807 wrote 14 hours 46 min ago:
Entropy says it's special. If you have a million dots
and 10,000 coordinates, you have 10,000 ways for all the
dots to land in the same coordinate, and a zillion
kavillion stupillion ways to have somewhere near 100 dots
in each coordinate.
lovich wrote 13 hours 11 min ago:
No, if its randomly distributed then every specific
configuration has the same exact chance of happening.
I am laughing at all the people coming out of the
woodwork to reply to my original post in this thread
misunderstanding randomness and chance.
If you flip a coin a million times and it lands on head
every single time, the millionth and 1 time still has a
50/50 chance of landing on heads
Dylan16807 wrote 12 hours 49 min ago:
> every specific configuration
Who said anything about specific configurations?
We started this talking about whether things "clump"
or not. The result depends on your definition of
"clump" but let's say it involves a standard
deviation. Different standard deviations have wildly
different probabilities, even when every specific
configuration has the same probability.
Nobody responding to you is calculating things wrong.
We're talking about the shape of the data.
Categories. And those categories are different
sizes, because they have different numbers of
specific configurations in them.
> the millionth and 1 time
I don't see any connection between the above
discussion and the gambler's fallacy?
OptionOfT wrote 21 hours 1 min ago:
Belgium's national register number is similar:
YY.MM.DD-AAA.BB
In either the AAA or BB component there is something about the
gender.
But it does mean that there is a limit of people born per day of a
certain gender.
But for a given year, using a moniker will only delay the
inevitable. Sure, there are more numbers, but still limited as
there are SOME parts that need to reflect reality. Year, gender (if
that's still the case?) etc.
hyperman1 wrote 19 hours 38 min ago:
BB is a mod-97 checksum. The first A of AAA encodes your gender
in an even/odd fashion, I forgot if its the first or last A doing
that.
MM or DD can be 00 if unknown. Also MM has +20 or +40 in some
cases.
If you know someones birth date and gender, the INSZ is almost
certainly 1 in 500 numbers, with a heavy skew to the lower AAA.
Luckily, you can't do much damage with someones number,unlike an
USA SSN (but I'd still treat it confidential).
notpushkin wrote 14 hours 7 min ago:
> I'd still treat it confidential
Estonian isikukood is GYYMMDDNNNC, and is relatively public.
You can find mine pretty easily if you know where to look (no
spoilers!). Itâs relatively harmless.
Kazakh IIN is YYMMDDNNNNNN (where N might have some structure)
and is similarly relatively public: e.g. if youâre a sole
proprietor, chances are you have to hang your license on the
wall, which will have it.
Itâs a bit more serious: Iâve got my mail at the post
office by just showing a barcode of my IIN to the worker. They
usually scan it from an ID, which I donât have, but Iâve
figured out the format and created a .pkpass of my own. Zero
questions â hereâs your package, no we donât need your
passport either, have a nice day!
(Tangential, but Kazakhs also happen to have the most peculiar
post office layout: it looks exactly like a supermarket, where
you go in, find your packages (sorted by the tracking number,
IIRC), and go to checkout. Iâve never seen it anywhere else.)
tyre wrote 22 hours 37 min ago:
And then have to enter/handle a non-date through all systems? How
do you know if this non-dated person is over the age of minority?
Eligible for a pension?
Maybe the answer is to evenly spread the defaults over 365 days.
ralferoo wrote 22 hours 22 min ago:
If you don't know their birthday, you can presumably never answer
that question in any case.
If you only know the birth year and keyed 99 as the month for
unknown, then your algorithm would determine they were of a
correct age on the start of the year after that was true, which I
guess would be what you want for legal compliance.
If you don't even know if the birth year is correct, then the
correct process depends on policy. Maybe they choose any year,
maybe they choose the oldest/youngest year they might be, maybe
they just encode that as 0000/9999.
Again, if you don't know the birth year of someone, you would
have no way of knowing their age. I'm not sure that means that
the general policy of putting a birthday into their ID number is
flawed.
Many governments re-issue national IDs to the same person with
different numbers, which is far less problematic that the many
governments who choose to issue the same national ID (looking at
you USA with your SSN) to multiple individuals. It doesn't seem
like a massive imposition on a person who was originally issued
an ID based on not knowing when their birthday to be re-issued a
new ID when their birthday was ascertained. Perhaps even give
them a choice of keeping the old one knowing it will cause
problems, or take the new one instead and having the
responsibility to tell people their number had changed.
Presumably the governments that choose to embed the date into a
national ID number do so because it's more useful for their
purposes to do so than just assigning everyone a random number.
notpushkin wrote 14 hours 36 min ago:
> or take the new one instead and having the responsibility to
tell people their number had changed
Or have the opportunity to scam people into thinking youâre a
different person. (E.g. take a $1M loan, go bankrupt, remember
your birthday, and take a loan again.)
benterix wrote 23 hours 42 min ago:
Your comment is valid but is not related to the article.
spoiler wrote 23 hours 13 min ago:
More broadly, this is the ages old surrogate vs natural key
discussion, but yes the comment completely misses the point of the
article. I can only assume they didn't read it in full!
vintermann wrote 23 hours 4 min ago:
The article explicitly argues against the use of GUIDs as primary
keys, and I'm arguing for it.
A running number also carries data. Before you know it, someone's
relying on the ordering or counting on there not being gaps - or
counting the gaps to figure out something they shouldn't.
michaelt wrote 18 hours 54 min ago:
> A running number also carries data. Before you know it,
someone's relying on the ordering or counting on there not
being gaps - or counting the gaps to figure out something they
shouldn't.
For example, if [1] can be seen but [2] can't, someone might
infer the existence of a hidden issue relating to a critical
security problem.
Whereas if the URL was instead [3] that risk would be avoided.
HTML [1]: https://github.com/pytorch/pytorch/issues/111111
HTML [2]: https://github.com/pytorch/pytorch/issues/111110
HTML [3]: https://github.com/pytorch/pytorch/issues/761500e0-007...
konart wrote 21 hours 37 min ago:
>Before you know it, someone's relying
Do not expose your internal IDs. As simple as that.
dpark wrote 21 hours 24 min ago:
This came up in the last two threads I read about uuidv7.
This is simply not a meaningful statement. Any ID you expose
externally is also an internal ID. Any ID you do not expose
is internal-only.
If you expose data in a repeatable way, you still have to
choose what IDs to expose, whether thatâs the primary key
or a secondary key. (In some cases you can avoid exposing
keys at all, but those are narrow cases.)
konart wrote 20 hours 55 min ago:
You have one ID as a primary key. It is used for building
relations in your database.
The second ID has nothing to do with internal structure of
your data. It is just another field.
You can change your structure however you want (or type of
your "internal" IDs) and you don't have to worry about an
external consumer. They still get their artificial ID.
dpark wrote 20 hours 43 min ago:
So what you meant is not to expose the primary key?
Thatâs a more reasonable statement but I still donât
agree. This feels like one of those âbest practicesâ
that people apply without thinking and create pointless
complexity.
Donât expose your primary key if there is a reason to
separate your primary key from the externally-exposed
key. If your primary key is the form that you want to
expose, then you should just expose the primary key. e.g.
If your primary key is a UUID, and you create a separate
UUID just to expose publicly, you have most likely added
useless complexity to your system.
whynotminot wrote 16 hours 28 min ago:
> create pointless complexity
My exact thought.
A lot else has failed in your system, from access
control to API design, if this becomes a problem.
Security by obscurity isnât the answer.
If the only thing between an attacker and your DB is
that they canât guess the IDs youâre already in
some serious trouble.
benterix wrote 22 hours 36 min ago:
> The article explicitly argues against the use of GUIDs as
primary keys, and I'm arguing for it.
Let's clarify things.
The author argues against UUIDv4 as primary keys when compared
to integers or bigints in large databases for performance
reasons.
The examples you give refer to the common mistake of using a
non-unique attribute that can be changed for a given entity as
a primary key.
scottlamb wrote 23 hours 46 min ago:
> Permanent identifiers should not carry data.
I think you're attacking a straw man. The article doesn't say
"instead of UUIDv4 primary keys, use keys such as birthdays with
exposed semantic meaning". On the contrary, they have a section about
how to use sequence numbers internally but obfuscated keys
externally. (Although I agree with dfox's and formerly_proven's
comments [1, 2] that XOR method they proposed for this is terrible.
Reuse of a one-time pad is probably the most basic textbook example
of bad cryptography. They referred to the values as "obfuscated" so
they probably know this. They should have just gone with a better
method instead.) [1]
HTML [1]: https://news.ycombinator.com/item?id=46272985
HTML [2]: https://news.ycombinator.com/item?id=46273325
patmorgan23 wrote 22 hours 12 min ago:
Insert order or time is information. And if you depend on that
information you are going to be really disappointed when back dated
records have to be inserted.
scottlamb wrote 21 hours 54 min ago:
Right, to ensure your clients don't depend on that information,
make the key opaque outside the database through methods such as
the ones dfox and formerly_proven suggested, as I said.
naasking wrote 23 hours 29 min ago:
I don't think the objection is that it exposes semantic meaning,
but that any meaningful information is contained within the key at
all, eg. even a UUID that includes timestamp information about when
it was generated is "bad" in a sense, as it leaks information.
Unique identifiers should be opaque and inherently meaningless.
scottlamb wrote 22 hours 49 min ago:
Your understanding is inconsistent with the examples in
vintermann's comment. Using a sequence number as an internal-only
surrogate key (deliberately opaqued when sent outside the bounds
of the database) is not the same as sticking gender identity,
birth date, or any natural properties of a book into a broadly
shared identifier.
naasking wrote 20 hours 43 min ago:
No it's not, they very explicitly clarify in follow-up comments
that unique identifiers should not be embedded any kind of
meaningful content. See: [1]
HTML [1]: https://news.ycombinator.com/item?id=46276995
HTML [2]: https://news.ycombinator.com/item?id=46273798
scottlamb wrote 20 hours 41 min ago:
Okay, but they ignore the stuff I was talking about,
consistent with my description of this as a straw man attack.
> A running number also carries data. Before you know it,
someone's relying on the ordering or counting on there not
being gaps - or counting the gaps to figure out something
they shouldn't.
The opaquing prevents that.
They also describe this as a "premature optimization". That's
half-right: it's an optimization. Having the data to support
an optimization, and focusing on optimizing things that are
hard to migrate later, is not premature.
sgarland wrote 1 day ago:
> Permanent identifiers should not carry data.
Did you read the article? He doesnât recommend natural keys, he
recommends integer-based surrogates.
> A prime example of premature optimization.
Disagree. Data is sticky, and PKs especially so. Moreover, if
youâre going to spend time optimizing anything early on, it should
be your data model.
> Don't make decisions you will regret just to shave off a couple of
milliseconds!
A bad PK in some databases (InnoDB engine, SQL Server if clustered)
can cause query times to go from sub-msec to tens of msec quite
easily, especially with cloud solutions where storage isnât
node-local. I donât just mean a UUID; a BIGINT PK on a 1:M can
destroy your latency for the simple reason of needing to fetch a
separate page for every record. If instead the PK is a composite of
(, id) - e.g. (user_id, id) - where id is a monotonic integer,
youâll have WAY better data locality.
Postgres suffers a different but similar problem with its visibility
map lookups.
liuliu wrote 20 hours 10 min ago:
I read it (and regret it is a waste of my time). Their arguments
are:
* integer keys are faster;
* uuidv7 keys are faster;
* if you want obfuscated keys, using integer and do some your own
obfuscation (!!!).
I can get on-board of uuidv7 (with the trade-off, of course, on
stronger guessability). The integer keys argument is strange. At
that point, you need to come up with a custom-built system to avoid
id collision in a distribution system and tries to achieve only 2x
saving (the absolute minimal you should do is 64-bit keys). Very
puzzling suggestion and to me very wrong.
Note that in this entire article, the recommendation is not about
using natural keys (email address, some composite of user
identification etc.), so I am skipping that whole discussion.
sgarland wrote 15 hours 55 min ago:
You can hand out chunks of sequential ids from a central
coordinator to avoid collision; this is a well-established
pattern.
Re: natural keys (or something like it), I was using it as an
example of how badly PK choice can impact performance at scale.
PunchyHamster wrote 16 hours 33 min ago:
> (with the trade-off, of course, on stronger guessability).
you're not guessing 2^72 bit random number. And if guessing UUID
does something in your app, you already fucked up
lukeschlather wrote 23 hours 19 min ago:
> Did you read the article? He doesnât recommend natural keys, he
recommends integer-based surrogates.
I am not a cryptographer, but I would want his recommendation
reviewed by a cryptographer. And then I would have to implement it.
UUIDs have been extensively reviewed by cryptographers, I have a
variety of excellent implementations I can use, I know they solve
the problem well. I know they can cause performance issues; they're
a security feature that is easy to implement, and I can deal with
the performance issues if and when they crop up. (Which, in my
experience, it's unusual. Even at a large company, most databases I
encounter do not have enough data. I will err on the side of
security until it becomes a problem, which is a good problem to
have.)
alerighi wrote 22 hours 53 min ago:
Why they are a security feature? They are not, the article even
says it. Even if UUID4 are random, nobody guarantees that they
are generated with a cryptographically secure random number
generator, and in fact most implementations don't!
The reason why in a lot of context you use UUID is when you have
a distributed system where you want your client to decide the ID
that is then stored in multiple systems that not communicate.
This is surely a valid scenario for random UUID.
To me the rule is use UUID as a customer-facing ID for things
that has to have an identity (e.g. a user, an order, etc) and
expose it publicly through APIs, use integer ID as internal
identifier that are used to create relations between entities,
and interal IDs are always kept private. That way numeric ID that
are more efficient remain inside the database and are used for
joining data, UUID is used only for accessing the object from an
API (for example) but then internally when joining (where you
have to deal with a lot of rows) you can use the more efficient
numeric ID.
By the way, I think that the thing of "using UUID" came from
NoSQL databases, where surely you use an UUID, but also you don't
have to join data. People than transposed a best practice in one
scenario to SQL, where its not really that best practice...
lukeschlather wrote 22 hours 17 min ago:
If a sequential ID is exposed to the client, the client can
trivially use it to determine the number of records and the
relative age of any records. UUID solves this, and the use of a
cryptographically secure number generator isn't really
necessary for it to solve this. The author's scheme might be
similarly effective, but I trust UUIDs to work well. There are
obviously varying ways to hide this information other than
UUIDs, but UUIDs are simple and I don't have to think about it,
I just get the security benefits. I don't have to worry about
not exposing IDs to the clients, I can do it freely.
sgarland wrote 19 hours 42 min ago:
I have never seen anyone post an actual example of the German
Tank problem creating an issue for them, only that itâs
possible.
> I donât have to think about it
And here we have the main problem of most DB issues I deal
with on a daily basis - someone didnât want to think about
the implications of what they were doing, and itâs suddenly
then my emergency because they have no idea how to address
it.
mrkeen wrote 10 hours 30 min ago:
Here you go: [1] [2025-12-10]
> We generate every valid 7-digit North American phone
number, then for every area code, send every number in
batches of 40000
> Time to go do something else for a while. Just over 27
hours and one ill-fated attempt at early season ski touring
later, the script has finished happily, the logfile is full
of entries, and no request has failed or taken longer than
3 seconds. So much for rate limiting. Weâve leaked every
Freedom Chat userâs phone number
HTML [1]: https://news.ycombinator.com/item?id=46279123
lukeschlather wrote 16 hours 27 min ago:
If you can predict user IDs this is extremely useful when
you're trying to come up with an exploit that might create
a privileged user, or perhaps you can create some object
you have access to that is owned by users that will be
created in the near future.
When I say "I don't have to think about it" I mean I don't
have to think about the ways an attacker might be able to
predict information about my user ids which they could use
to gain access to accounts, because I know they cannot
predict information about user ids.
You are dismissing the implications of using something that
is less secure than UUIDs and you haven't convinced me I'm
the one failing to think through the implications. I know
there are performance problems, I know they might require
some creative solutions. I am not worried about
unpredictable performance issues, I am worried about
unpredictable security problems.
sgarland wrote 15 hours 58 min ago:
Perhaps this is my bias coming through. I work with DBs
day in and day out, and the main problem I face is
performance from poorly-designed schemas and queries;
next largest issue is referential integrity violations
causing undefined behavior. The security issues Iâve
found were all people doing absurdly basic stuff, like
exposing an endpoint that dumped passwords.
To me, if youâre relying on having a matching PK as
security, something has already gone wrong. There are
ways to provide AuthN and AuthZ other than that. And yes,
âdefense in depth,â but if your base layer is âwe
have unguessable user ids,â IME people will become
complacent, and break it somewhere else in the stack.
barrkel wrote 1 day ago:
Uuid v7 just has a bias in its generation; it isn't carrying
information. You're not going to try and extract a timestamp from a
uuid.
Random vs time biased uuids are not a decision to shave off ms that
you will regret.
Most likely they will be a decision that shaves off seconds (yes,
really - especially when you consider locality effects) and you'll
regret nothing.
ncruces wrote 18 hours 8 min ago:
> You're not going to try and extract a timestamp from a uuid.
So, random library: [1] > Time returns the time in 100s of
nanoseconds since 15 Oct 1582 encoded in uuid. The time is only
defined for version 1, 2, 6 and 7 UUIDs.
HTML [1]: https://pkg.go.dev/github.com/google/uuid#UUID.Time
duckerude wrote 22 hours 39 min ago:
I've worked on a system where ULIDs (not UUIDv7, but similar) were
used with a cursor to fetch data in chronological order and
thenâsurprise!âone day records had to be backdated, meaning
that either the IDs for those records had to be counterfeited
(potentially violating invariants elsewhere) or the fetching had to
be made smarter.
You can choose to never make use of that property. But it's
tempting.
voidnap wrote 20 hours 52 min ago:
I made a service using something like a 64 bit wide ULID but
there was never a presumption that data is be inserted or updated
earlier than the most recent record.
If the domain is modeling something like external events (in my
case), and that external timestamp is packed into your primary
key, and you support receiving events out of chronological order,
then it just follows that you might insert stuff ealrier than you
latest record.
You're gonna have problems "backdating" if you mix up time of
insertion with when the event you model actually ocurred. Like id
you treat those as the same thing when they aren't.
jandrewrogers wrote 23 hours 46 min ago:
> You're not going to try and extract a timestamp from a uuid.
Hyrum's Law suggests that someone will.
tobyhinloopen wrote 1 day ago:
> You're not going to try and extract a timestamp from a uuid.
I totally used uuidv7s as "inserted at" in a small project and I
had methods to find records created between two timestamps that
literally converted timestamps to uuidv7 values so I could do
"WHERE id BETWEEN a AND b"
bri3d wrote 1 day ago:
> You're not going to try and extract a timestamp from a uuid.
What? The first 48 bits of an UUID7 are a UNIX timestamp.
Whether or not this is a meaningful problem or a benefit to any
particular use of UUIDs requires thinking about it; in some cases
itâs not to be taken lightly and in others it doesnât matter at
all.
I see what youâre getting at, that ignoring the timestamp aspect
makes them âjust better UUIDs,â but this ignores security
implications and the temptation to partition by high bits
(timestamp).
nine_k wrote 23 hours 25 min ago:
Nobody forces you to use a real Unix timestamp. BTW the original
Unix timestamp is 32 bits (expiring in 2038), and now everyone is
switching to 64-bit time_t. What 48 bits?
All you need is a guaranteed non-decreasing 48-bit number. A
clock is one way to generate it, but I don't see why a UUIDv7
would become invalid if your clock is biased, runs too fast, too
slow, or whatever. I would not count on the first 48 bits being a
"real" timestamp.
bri3d wrote 22 hours 47 min ago:
> Nobody forces you to use a real Unix timestamp.
Besides the UUIDv7 specification, that is? Otherwise you have
some arbitrary kind of UUID.
> I would not count on the first 48 bits being a "real"
timestamp.
I agree; this is the existential hazard under discussion which
comes from encoding something that might or might not be data
into an opaque identifier.
I personally don't agree as dogmatically with the grandparent
post that extraneous data should _not_ be incorporated into
primary key identifiers, but I also disagree that "just use
UUIDv7 and treat UUIDs as opaque" is a completely plausible
solution either.
sroussey wrote 17 hours 57 min ago:
That is like the HTML specification -- nobody ever puts up a
web page that is not conformant. ;p
The idea behind putting some time as prefix was for btree
efficiency, but lots of people use client side generation and
you can't trust it, and it should not matter because it is
just an id not a way of registering time.
nine_k wrote 20 hours 45 min ago:
I mean, any 32-bit unsigned integer is a valid Unix timestamp
up until 19 January 2038, and, by extension, any u64 is, too,
for far longer time.
The only promise of Unix timestamps is that they never go
back, always increase. This is a property of a sequence of
UUIDs, not any particular instance. At most, one might argue
that an "utterly valid" UUIDv7 should not contain a timestamp
from far future. But I don't see why it can't be any time in
the past, as long as the timestamp part does not decrease.
The timestamp aspect may be a part of an additional interface
agreement: e.g. "we guarantee that this value is UUIDv7 with
the timestamp in UTC, no more than a second off". But I
assume that most sane engineers won't offer such a guarantee.
The useful guarantee is the non-decreasing nature of the
prefix, which allows for sorting.
oblio wrote 1 day ago:
> Well, wrong, since although the date doesn't change.
Someone should have told Julius Caesar and Gregory XIII that :-p
hnfong wrote 1 day ago:
The curious thing about the article is that, it's definitely
premature optimization for smaller databases, but when the database
gets to the scale where these optimizations start to matter, you
actually don't want to do what they suggest.
Specifically, if your database is small, the performance impact is
probably not very noticeable. And if your database is large (eg. to
the extent primary keys can't fit within 32-bit int), then you're
actually going to have to think about sharding and making the system
more distributed... and that's where UUID works better than
auto-incrementing ints.
scottlamb wrote 19 hours 3 min ago:
I agree there's a scale below which this (or any) optimization
matters and a scale above which you want your primary key to have
locality (in terms of which shard/tablet/... is responsible for the
record). But...
* I think there is a wide range in the middle where your database
can fit on one machine if you do it well, but it's worth optimizing
to use a cheaper machine and/or extend the time until you need to
switch to a distributed db. You might hit this middle range soon
enough (and/or it might be a painful enough transition) that it's
worth thinking about it ahead of time.
* If/when you do switch to a distributed database, you don't always
need to rekey everything:
** You can spread existing keys across shards via hashing on lookup
or reversing bits. Some databases (e.g. DynamoDB) actually force
this.
** Allocating new ids in the old way could be a big problem, but
there are ways out. You might be able to switch allocation schemes
entirely without clients noticing if your external keys are
sufficiently opaque. If you went with UUIDv7 (which addresses some
but not all of the article's points), you can just keep using it.
If you want to keep using dense(-ish), (mostly-)sequential bigints,
you can amortize the latency by reserving blocks at a time.
hyperpape wrote 1 day ago:
Your comment is sufficiently generic that itâs impossible to tell
what specific part of the article youâre agreeing with, disagreeing
with, or expanding upon.
vintermann wrote 1 day ago:
I disagree that performance should be a reason to choose running
numbers over guids until you absolutely have to.
I think IDs should not carry information. Yes, that also means I
think UUIDv7 was wrong to squeeze a creation date into their ID.
Isn't that clear enough?
hxtk wrote 21 hours 22 min ago:
When I think "premature optimization," I think of things like
making a tradeoff in favor of performance without justification.
It could be a sacrifice of readability by writing uglier but more
optimized code that's difficult to understand, or spending time
researching the optimal write pattern for a database that I could
spend developing other things.
I don't think I should ignore what I already know and
intentionally pessimize the first draft in the name of avoiding
premature optimization.
hyperpape wrote 1 day ago:
Those are two unrelated points and the connection between them
was unclear in the original post.
anamexis wrote 1 day ago:
For what itâs worth, it was also completely unclear to me how
you were responding to the article itself. It does not discuss
natural keys at all.
barrkel wrote 1 day ago:
UUID v7 doesn't squeeze creation date in. If you treat it as
anything other than a random sequence in your applications,
you're just wrong.
zamadatix wrote 1 day ago:
"What it does" and "what I think you should do with it" should
not be treated as equivalent statements.
mcny wrote 1 day ago:
That's the creation date of that guid though. It doesn't say
anything about the entity in question. For example, you might be
born in 1987 and yet only get a social security number in 2007
for whatever reason.
So, the fact that there is a date in the uuidv7 does not extend
any meaning or significance to the record outside of the
database.
To infer such a relationship where none exists is the error.
kube-system wrote 1 day ago:
The time component either has meaning and it should be in its
own column, or it doesn't have meaning and it is unnecessary
and shouldn't be there at all.
I'm not a normalization fanatic, but we're only talking about
1NF here.
vintermann wrote 1 day ago:
You can argue that, but then what is its purpose? Why should
anyone care about the creation date of a by-design completely
arbitrary thing?
I bet people will extract that date and use it, and it's hard
to imagine use which wouldn't be abuse. To take the example of
a PN/SSN and the usual gender bit: do you really want anyone to
be able to tell that you got a new ID at that time? What could
you suspect if a person born in 1987 got a new PN/SSN around
2022?
Leaks like that, bypassing whatever access control you have in
your database, is just one reason to use real random IDs. But
it's even a pretty good one in itself.
kentm wrote 9 hours 30 min ago:
> You can argue that, but then what is its purpose?
The purpose is to reduce randomness while still preserving
probability of uniqueness. UUIDv4 come with performance
issues when used to bucket data for updates, such as when
there used as primary keys in a database.
A database like MySQL or PostgreSQL has sequential ids and
youâd use those instead, but if youâre writing something
like iceberg tables using Trino/Spark/etc then being able to
generate unique ids (without using a data store) that tend to
be clustered together is useful.
majorchord wrote 1 day ago:
> You can argue that, but then what is its purpose? Why
should anyone care about the creation date of a by-design
completely arbitrary thing?
Pretty sure sorting and filtering them by date/time range in
a database is the purpose.
dpark wrote 21 hours 17 min ago:
That is absolutely not the purpose. The specific purpose of
uuidv7 is to optimize for B-Tree characteristics, not so
you can craft queries based on the IDs being sequential.
This assumption that you can query across IDs is exactly
what is being cautioned against. As soon as you do that,
you are talking a dependency on an implementation detail.
The contract is that you get a UUID, not that you get 48
bits of timestamp. There are 8 different UUID types and
even v7 has more than one variant.
kentm wrote 9 hours 28 min ago:
B-trees too but also bucketing for formats like delta
lake or iceberg, where having ids that cluster will
reduce the number of files you need to update.
miroljub wrote 1 day ago:
If you need sorting and filtering by date, just add a
timestamp to your table instead of misusing an Id column
for that.
majorchord wrote 23 hours 21 min ago:
> just
It is easy to have strong opinions about things you are
sheltered from the consequences of.
mixmastamyk wrote 23 hours 23 min ago:
That happens, in general. The benefit comes when itâs
time to look up by uuid only; the prefix is an index to
its disk block location.
dpark wrote 21 hours 15 min ago:
> the prefix is an index to its disk block location
What? This is definitely not the case and canât be
because B-tree nodes change while UUIDs do not.
mixmastamyk wrote 21 hours 14 min ago:
I didnât mean that literally, but no longer
editable. Was supposed to have âlikeâ etc in
there.
dpark wrote 20 hours 59 min ago:
But UUIDv7 doesnât change that at all. It
doesnât matter what flavor of UUID you choose.
The ID is always âlikeâ an index to a block in
that you traverse the tree to find the node. What
UUIDv7 does is improve some performance
characteristics when creating new entries and
potentially for caching.
naasking wrote 23 hours 33 min ago:
Exactly, be explicit, don't shoehorn multiple purposes
into a single column that's supposed to be a largely
meaningless unique identifier.
anamexis wrote 1 day ago:
I would argue that is one of very few situations where
leaking the timestamp that the ID was created when you
already have the ID is a possible concern at all.
And when working with very large datasets, there are very
significant downsides to large, completely random IDs (which
is of course what the OP is about).
mcny wrote 1 day ago:
> What could you suspect if a person born in 1987 got a new
PN/SSN around 2022?
Thank you for spelling it for me.
For the readers,
It leaks information that the person is likely not a natural
born citizen.
The assumption doesn't have to be a hundred percent accurate,
There is a way to make that assumption
And possibly hold it against you.
And there are probably a million ways that a record created
date could be held against you
If they don't put it in writing, how will you prove
They discriminated against you.
Thinking... I don't have a good answer to this. If data
exists, people will extract meaning from it whether rightly
or not.
infogulch wrote 1 day ago:
To quote the great Mr Sparrow:
> The only rules that really matter are these: what a man
can do and what a man can't do.
When evaluating security matters, it's better to strip off
the moral valence entirely ("rightly") and only consider
what is possible given the data available.
Another potential concerning implication besides
citizenship status: a person changed their id when put in a
witness protection program.
mkleczek wrote 1 day ago:
This is actually a very deep and interesting topic.
Stripping information from an identifier disconnects a piece of data
from the real world which means we no longer can match them. But such
connection is the sole purpose of keeping the data in the first
place. So, what happens next is that the real world tries to adjust
and the "data-less" identifier becomes a real world artifact. The
situation becomes the same but worse (eg. you don't exist if you
don't remember your social security id). In extreme cases people are
tattooed with their numbers.
The solution is not to come up with yet another artificial identifier
but to come up with better means of identification taking into
account the fact that things change.
PunchyHamster wrote 16 hours 36 min ago:
Identifier is just "a piece of common token system can use to
operate on same entity.
You need it. Because it's maybe one lone unchangeable thing. Taking
person for example:
* date of birth can be changed, if there was error and correction
in documents
* any and near all of existing physical characteristics can change
over time, either due to brain things (deciding to change gender),
aging, or accidents (fingerprints no longer apply if you burnt your
skin enough)
* DNA might be good enough, but that's one fucking long identifier
to share and one hard to validate in field.
So an unique ID attached to few other parts to identify current
iteration of individual is the best we have, and the best we will
get.
brettgriffin wrote 21 hours 23 min ago:
> Stripping information from an identifier disconnects a piece of
data from the real world which means we no longer can match them.
But such connection is the sole purpose of keeping the data in the
first place.
The surrogate key's purpose isn't to directly store the natural
key's information, rather, it's to provide an index to it.
> The solution is not to come up with yet another artificial
identifier but to come up with better means of identification
taking into account the fact that things change.
There isn't 'another' - there's just one. The surrogate key. The
other pieces of information you're describing are not the means of
indexing the data. They are the pieces of data you wish to
retrieve.
mkleczek wrote 20 hours 36 min ago:
Any piece of information that can be used to retrieve something
using this index has to be available "outside" your database -
ie. to issue a query "give me piece of information identified by
X" you have to know X first. If X is only available in your index
then you must have another index to retrieve X based on some
externally available piece of information Y. And then X becomes
useless as an identifier - it just adds a level of indirection
that does not solve any information retrieval problem.
That's my whole point: either X becomes a "real world artifact"
or it is useless as identifier.
brettgriffin wrote 1 hour 24 min ago:
That's not really how data is requested. Most of these
identifiers are foreign keys - they exist in a larger object
graph. Most systems of records are too large for people to
associate surrogate keys to anything meaningful - they can
easily have hundreds of billions of records.
Rather, users traverse that through that object graph,
narrowing a range of keys of interest.
This hacker news article was given a surrogate key, 46272487.
From that, you can determine what it links to, the
name/date/author of the submission, comments, etc.
46272487 means absolutely nothing to anybody involved. But if
you wanted to see submissions from user pil0u, or submissions
submissions on 2025-12-15, or submissions pertaining to UUID,
46272487 would in that in that result set. Once 46272487 joins
out to all of its other tables, you can populate a list that
includes their user name, title, domain, etc.
Do not encode identifying information in unique identifiers!
The entire world of software is built on surrogate keys and
they work wonderfully.
everforward wrote 1 day ago:
> The solution is not to come up with yet another artificial
identifier but to come up with better means of identification
taking into account the fact that things change.
I think artificial and data-less identifiers are the better means
of identification that takes into account that things change. They
don't have to be the identifier you present to the world, but
having them is very useful.
E.g. phone numbers are semi-common identifiers now, but phone
numbers change owners for reasons outside of your control. If you
use them as an internal identifier, changing them between accounts
gets very messy because now you don't have an identifier for the
person who used to have that phone number.
It's much cleaner and easier to adapt if each person gets an
internal context-less identifier and you use their phone number to
convert from their external ID/phone number to an internal ID. The
old account still has an identifier, there's just no external
identifier that translates to it. Likewise if you have to change
your identifier scheme, you can have multiple external IDs that
translate to the same internal ID (i.e. you can resolve both their
old ID and their new ID to the same internal ID without insanity in
the schema).
mkleczek wrote 22 hours 38 min ago:
> I think artificial and data-less identifiers are the better
means of identification that takes into account that things
change. They don't have to be the identifier you present to the
world, but having them is very useful.
If the only reason you need a surrogate key is to introduce
indirection in your internal database design then sequence
numbers are enough. There is no need to use UUIDs.
The whole discussion is about externally visible identifiers (ie.
identifiers visible to external software, potentially used as a
persistent long-term reference to your data).
> E.g. phone numbers are semi-common identifiers now, but phone
numbers change owners for reasons outside of your control. If you
use them as an internal identifier, changing them between
accounts gets very messy because now you don't have an identifier
for the person who used to have that phone number.
Introducing surrogate keys (regardless of whether UUIDs or
anything else) does not solve any problem in reality. When I come
to you and say "My name is X, this is my phone number, this is my
e-mail, I want my GDPR records deleted", you still need to be
able to find all data that is related to me. Surrogate keys don't
help here at all. You either have to be able to solve this issue
in the database or you need to have an oracle (ie. a person) that
must decide ad-hoc what piece of data is identified by the
information I provided.
The key issue here is that you try to model identifiable
"entities" in your data model, while it is much better to model
"captured information".
So in your example there is no "person" identified by "phone
number" but rather "at timestamp X we captured information about
a person at the time named Y and using phone number Z".
Once you start thinking about your database as structured storage
of facts that you can use to infer conclusions, there is much
less need for surrogate keys.
dpark wrote 19 hours 6 min ago:
> So in your example there is no "person" identified by "phone
number" but rather "at timestamp X we captured information
about a person at the time named Y and using phone number Z".
Once you start thinking about your database as structured
storage of facts that you can use to infer conclusions, there
is much less need for surrogate keys.
This is so needlessly complex that you contradicted yourself
immediately. You claim there is no âpersonâ identified but
immediately say you have information âabout a personâ. The
fact that you can assert that the information is about a person
means that you have identified a person.
Clearly tying data to the person makes things so much easier. I
feel like attempting to do what you propose is begging to mess
up GDPR erasure.
> âSo I got a request from a John Doe to erase all data we
recorded for them. They identified themselves by mailing
address and current phone number. So we deleted all data we
recorded for that phone number.â
> âDid you delete data recorded for their previous phone
number?â
> âUh, what?â
The stubborn refusal to create a persistent identifier makes
your job harder, not easier.
everforward wrote 21 hours 55 min ago:
> If the only reason you need a surrogate key is to introduce
indirection in your internal database design then sequence
numbers are enough. There is no need to use UUIDs.
The UUID would be an example of an external key (for e.g.
preventing crawling keys being easy). This article mentions a
few reasons why you may later decide there are better external
keys.
> When I come to you and say "My name is X, this is my phone
number, this is my e-mail, I want my GDPR records deleted", you
still need to be able to find all data that is related to me.
How are you going to trace all those records if the requester
has changed their name, phone number and email since they
signed up if you don't have a surrogate key? All 3 of those
are pretty routine to change. I've changed my email and phone
number a few times, and if I got married my name might change
as well.
> Once you start thinking about your database as structured
storage of facts that you can use to infer conclusions, there
is much less need for surrogate keys.
I think that spirals into way more complexity than you're
thinking. You get those timestamped records about "we got info
about person named Y with phone number Z", and then person Y
changes their phone number. Now you're going to start getting
records from person named Y with phone number A, but it's the
same account. You can record "person named Y changed their
phone number from Z to A", and now your queries have to be
temporal (i.e. know when that person had what phone number).
You could back-update all the records to change Z to A, but
that breaks some things (e.g. SMS logs will show that you sent
a text to a number that you didn't send it to).
Worse yet, neither names nor phone numbers uniquely identify a
person, so it's entirely possible to have records saying
"person named Y and phone number Z" that refer to different
people if a phone number transfers from a John Doe to a
different person named John Doe.
I don't doubt you could do it, but I can't imagine it being
worth it. I can't imagine a way to do it that doesn't either
a) break records by backdating information that wasn't true
back then, or b) require repeated/recursive querying that will
hammer the DB (e.g. if someone has had 5 phone numbers, how do
you get all the numbers they've had without pulling the latest
one to find the last change, and then the one before that, and
etc). Those queries are incredibly simple with surrogate keys:
"SELECT * FROM phone_number_changes WHERE user_id = blah".
mkleczek wrote 20 hours 55 min ago:
> The UUID would be an example of an external key (for e.g.
preventing crawling keys being easy). This article mentions a
few reasons why you may later decide there are better
external keys.
So we are talking about "external" keys (ie. visible outside
the database). We are back to square one: externally visible
surrogate keys are problematic because they are detached from
real world information they are supposed to identify and
hence don't really identify anything (see my example about
GDPR).
It does not matter if they are random or not.
> How are you going to trace all those records if the
requester has changed their name, phone number and email
since they signed up if you don't have a surrogate key?
And how does surrogate key help? I don't know the surrogate
key that identifies my records in your database.
Even if you use them internally it is an implementation
detail.
If you keep information about the time information was
captured, you can at least ask me "what was your phone number
last time we've interacted and when was it?"
> I think that spirals into way more complexity than you're
thinking.
This complexity is there whether you want it or not and
you're not going to eliminate it with surrogate keys. It has
to be explicitly taken care of.
DBMSes provide means to tackle this essential complexity:
bi-temporal extensions, views, materialized views etc.
Event sourcing is a somewhat convoluted way to attack this
problem as well.
> Those queries are incredibly simple with surrogate keys:
"SELECT * FROM phone_number_changes WHERE user_id = blah".
Sure, but those queries are useless if you just don't know
user_id.
dpark wrote 18 hours 50 min ago:
> externally visible surrogate keys are problematic because
they are detached from real world information they are
supposed to identify and hence don't really identify
anything (see my example about GDPR).
All IDs are detached from the real world. Thatâs the core
premise of an ID. Itâs a bit of information that is
unique to someone or something, but it is not that person
or thing.
Your phone number is a random number that the phone company
points to your phone. Your house has a street name and
number that someone decided to assign to it. Your email is
an arbitrary label that is used to route mail to some
server. Your social security number is some arbitrary id
the government assigned you. Even your name is an arbitrary
label that your parents assigned to you.
Fundamentally your notion that there is some âreal
worldâ identifier is not true. No identifiers are real.
They are all abstractions and the question is not whether
the ârealâ identifier is better than a âfakeâ one,
but whether an existing identifier is better than one you
create for your system.
I would argue that in most cases, creating your own ID is
going to save you headaches in the long term. If you bake
SSN or Email or Phone Number throughout your system, you
will make it a pain for yourself when inevitably someone
needs to change their ID and you have cascading updates
needed throughout your entire system.
halffullbrain wrote 19 hours 57 min ago:
In my country, citizens have an "ID" (a UUID, which most
people don't know the value of!) and a social security
number which they know - which has all the problems
described above).
While the social security number may indeed change (doubly
assigned numbers, gender reassignment, etc.), the ID
needn't change, since it's the same physical person.
Public sector it-systems may use the ID and rely on it not
changing.
Private sector it-systems can't look up people by their ID,
but only use the social security number for comparisons and
lookups, e.g. for wiping records in GDPR "right to be
forgotten"-situations. Social security numbers are
sortof-useful for that purpose because they are printed on
passports, driver's licenses and the like. And they are a
problem w.r.t. identity theft, and shouldn't ever be used
as an authenticator (we have better methods for that).
The person ID isn't useful for identity theft, since it's
only used between authorized contexts (disregarding
Byzantine scenarios with rogue public-sector actors!). You
can't social engineer your way to personal data using that
ID unless (safe a few movie-plot scenarios).
So what is internal in this case? The person id is indeed
internal to the public sector's it-systems, and useful for
tracking information between agencies. They're not useful
for Bob or Alice. (They ARE useful for Eve, or other
malicious inside actors, but that's a different story,
which realistically does require a much higher level of
digital maturity across the entire society)
Ukv wrote 1 day ago:
> Stripping information from an identifier disconnects a piece of
data from the real world which means we no longer can match them.
But such connection is the sole purpose of keeping the data in the
first place.
The identifier is still connected to the user's data, just through
the appropriate other fields in the table as opposed to embedded
into the identifier itself.
> So, what happens next is that the real world tries to adjust and
the "data-less" identifier becomes a real world artifact. The
situation becomes the same but worse (eg. you don't exist if you
don't remember your social security id). In extreme cases people
are tattooed with their numbers.
Using a random UUID as primary key does not mean users have to
memorize that UUID. In fact in most cases I don't think there's
much reason for it to even be exposed to the user at all.
You can still look up their data from their current email or phone
number, for instance. Indexes are not limited to the primary key.
> The solution is not to come up with yet another artificial
identifier but to come up with better means of identification
taking into account the fact that things change.
A fully random primary key takes into account that things change -
since it's not embedding any real-world information. That said I
also don't think there's much issue with embedding creation time in
the UUID for performance reasons, as the article is suggesting.
marcus_holmes wrote 11 hours 6 min ago:
> You can still look up their data from their current email or
phone number, for instance. Indexes are not limited to the
primary key.
This is the key point, I think. Searching is not the same as
identifying.
mkleczek wrote 22 hours 59 min ago:
> Using a random UUID as primary key does not mean users have to
memorize that UUID. In fact in most cases I don't think there's
much reason for it to even be exposed to the user at all.
So what is such an identifier for? Is it only for some technical
purposes (like replication etc.)?
Why bother with UUID at all then for internal identifiers?
Sequence number should be enough.
Ukv wrote 22 hours 19 min ago:
UUIDs are good for creating entries concurrently where
coordinating between distributed systems may be difficult.
May also be that you don't want to leak information like how
many orders are being made, as could be inferred from a
`/fetch_order?id=123` API with sequential IDs.
Sequential primary keys are still commonly used though - it's a
scenario-dependant trade-off.
mkleczek wrote 21 hours 57 min ago:
If you expose the identifier outside the database, it is no
longer "internal".
Ukv wrote 21 hours 8 min ago:
Given the chain was:
> > Using a random UUID as primary key does not mean users
have to memorize that UUID. [...]
> So what is such an identifier for? [...] Why bother with
UUID at all then for internal identifiers?
The context, that you're questioning what they're useful
for if not for use by the user, suggests that "internal"
means the complement. That is, IDs used by your company and
software, and maybe even API calls the website makes, but
not anything the user has to know.
Otherwise, if "internal" was intended to mean something
stricter (only used by a single non-distributed database,
not accessed by any applications using the database, and
never will be in the future), then my response is just that
many IDs are neither internal in this sense nor intended to
be memorized/saved by the user.
sethhochberg wrote 22 hours 24 min ago:
"Internal" is a blurry boundary, though - you pick integer
sequence numbers and then years on an API gets bolted on to
your purely internal database and now your system is vulnerable
to enumeration attacks. Does a vendor system where you
reference some of your internal data count as "internal"? Is
UID 1 the system user that was originally used to provision the
system? Better try and attack that one specifically... the list
goes on.
UUIDs or other similarly randomized IDs are useful because they
don't include any ordering information or imply anything about
significance, which is a very safe default despite the
performance hits.
There certainly are reasons to avoid them and the article we're
commenting on names some good ones, at scale. But I'd argue
that if you have those problems you likely have the resources
and experience to mitigate the risks, and that true
randomly-derived IDs are a safer default for most new systems
if you don't have one of the very specific reasons to avoid
them.
mkleczek wrote 21 hours 54 min ago:
> "Internal" is a blurry boundary, though
Not for me :)
"Internal" means "not exposed outside the database" (that
includes applications and any other external systems)
demurgos wrote 21 hours 20 min ago:
Internal means "not exposed outside some boundary". For
most people, this boundary encompasses something larger
than a single database, and this boundary can change.
vrighter wrote 1 day ago:
You can't take into account the fact that things change when you
don't know what those changes might be. You might end up needing to
either rebuild a new database, have some painful migration, or
support two codepaths to work with both types of keys.
mkleczek wrote 23 hours 2 min ago:
Network protocol designers know better and by default embed
protocol version number in message format spec.
I guess you can assign 3-4 bits for identifier version number as
well.
And yes - for long living data dealing with compatibility issues
is inevitable so you have to take that into account from the very
beginning.
vrighter wrote 9 hours 48 min ago:
when I designed network protocols this is exactly what I did. I
also did so in file formats had to create. But a database
primary kea is not somewhere where that can be easily done.
groundzeros2015 wrote 1 day ago:
You canât design something by trying to anticipate all future
changes. things will change and break.
In my personal design sense, I have found keeping away generality
actually helps my code last longer (based on more concrete ideas)
and easier to change when those days come.
dpark wrote 22 hours 33 min ago:
In my experience, virtually every time I bake concrete data
into identifiers I end up regretting it. This isnât a case of
trying to predict all possible future changes. Itâs a case of
trying to not repeat the exact same mistake again.
groundzeros2015 wrote 19 hours 26 min ago:
I donât disagree with that, Iâm disagreeing with this
comment that we canât make protocol or data decisions that
might change.
dpark wrote 19 hours 3 min ago:
I misunderstood then. I interpreted your comment to say
that you eschew generalization (e.g. uuids) in favor of
concrete data (e.g. names, email addresses) for ids in your
designs.
tacone wrote 1 day ago:
Fantastic real life example. Italian PNs carry also the gender, which
something you can change surgically, and you'll eventually run into
the issue when operating at scale.
I don't agree with the absolute statement, though. Permanent
identifiers should not generally carry data. There are situations
where you want to have a way to reconciliate, you have space or speed
constraints, so you may accept the trade off, md5 your data and store
it in a primary index as a UUID. Your index will fragment and thus
you will vacuum, but life will still be good overall.
cozyman wrote 23 hours 33 min ago:
how does one change their gender surgically?
delichon wrote 22 hours 25 min ago:
You can't, but since gender isn't defined by anything physical,
there's no need.
bigstrat2003 wrote 21 hours 21 min ago:
That is only true if you're using an extremely idiosyncratic
definition of gender. As far as 95% of English speakers are
concerned, gender is defined by the body you possess.
defrost wrote 17 hours 18 min ago:
As far as nigh on 100% of Bugis speakers are concerned there
has always been five genders and they'll tell you the words
in their language they have for them.
* [1] It appears to be a cultural construct.
HTML [1]: https://en.wikipedia.org/wiki/Buginese_language
EnergyAmy wrote 11 hours 33 min ago:
You and the other person are probably talking past each
other. For most people, "gender" is merely the polite way
of saying "sex", and that's probably what the other
commenter was referring to.
Gender in the sense of "the social roles and norms on top
of biological sex" is indeed a construct, though heavily
informed by the biology that they're based on. Biological
sex is very much real and not a construct.
defrost wrote 10 hours 50 min ago:
Of course biological sex is real and strongly bimodal
with outliers, who ever said otherwise?
EnergyAmy wrote 9 hours 53 min ago:
Technically correct, but to be specific sex is binary,
not merely bimodal. Sex is entirely defined by gametes,
and is binary in anisogamous species such as humans.
Isogamous species don't have sexes, they have mating
types (and often many thousands of them).
There's actually an ideological movement to try to
redefine sex based on sex traits instead of gametes,
but this ends up being incoherent and useless for the
field of biology. Biologists have had to publish papers
explaining the fundamentals of their field to counter
the ideological narrative:
Why There Are Exactly Two Sexes [1] That's why I
thought it was worth mentioning. Many people are
confused because of the culture wars. To bring it back
around to the general topic of this thread, it's fine
to store someone's sex as a boolean, because sex is
binary and immutable. Storing cultural constructs like
gender as anything other than an arbitrary string is
asking for trouble, though.
HTML [1]: https://link.springer.com/article/10.1007/s105...
defrost wrote 3 hours 56 min ago:
Reproductive sex is determined by gametes .. sure.
Not all humans are born with the attribute of
reproductive sex via gametes.
Hence "biological sex is real and strongly bimodal
with outliers" (in humans, it gets odder elsewhere
in animal life on earth) it's just not all
reproductive sex, nor is all just strictly M or
strictly F despite it mostly being one or the other.
> To bring it back around to the general topic of
this thread, it's fine to store someone's sex as a
boolean, because sex is binary and immutable.
Not in Australia, via a decision that ascended
through all levels of the national court system, nor
is sex, as you've chosen to define it ("entirely
defined by gametes") binary.
Biology is truly messy. It's understandable not
everbody truly grasps this.
Colin Wright is pretty much a prop up cardboard
"scientist" for the Manhattan Institute (a political
conservative think tank).
I tend to run with people with actual field
credentials doing real biology and medicine; Michael
Alpers, Fiona Stanley, Fiona Wood, et al were my
influences.
If Colin Wright scratches your itch for bad biology
then by all means run with the one hit wonder who
reinforces a preconception untroubled by empiricism.
EnergyAmy wrote 1 hour 44 min ago:
You can't legislate reality away. If you're
tracking biological sex, then it doesn't matter
what a court decides. If you're tracking legal
fictions then you might.
I look forward to your citation disputing the truth
of what he lays out in that paper. In the meantime,
feel free to peruse the list here of people
affirming the same stance: [1] Or someone else: [2]
You should ask the people you run with why no human
is born with a body not organized around the
production of gametes. You'll notice that when you
read about conditions like anorchia or ovarian
agenesis, the sex of the person with that condition
is not a mystery, it's literally in the name.
Biology is messy indeed, and that's why finding
such a universal definition was so useful.
HTML [1]: https://projectnettie.wordpress.com/
HTML [2]: https://www.nas.org/academic-questions/33/...
lovich wrote 17 hours 25 min ago:
The only real states of matter are solids, liquids, and
gases. Everything else is just woke lunacy.
I am confident in this fact because I learned it in
elementary school decades ago and it is impossible for
humanity to discover new information that updates our world
model. Every English speaker knows that âplasmasâ and
âBose-Eisenstein condensatesâ are made up.
brigandish wrote 14 hours 47 min ago:
We all await your Nobel for finding a third type of gamete.
lovich wrote 13 hours 15 min ago:
The person I was responding to was talking about gender,
but if you want to talk about biology then [1] [2] [3]
[4] [5] [6] [7] I assume you will be one of the advocates
for my nobel prize
edit: I'm sorry you specifically mentioned gametes, we
can talk about diploids and haploids if you wish and how
our bodies are such complicated machines that any sort of
error that can occur in our growth is guaranteed to at
scale
HTML [1]: https://en.wikipedia.org/wiki/Intersex#Prevalenc...
HTML [2]: https://en.wikipedia.org/wiki/Klinefelter_syndro...
HTML [3]: https://en.wikipedia.org/wiki/XXYY_syndrome
HTML [4]: https://en.wikipedia.org/wiki/XXXY_syndrome
HTML [5]: https://en.wikipedia.org/wiki/XXXYY_syndrome
HTML [6]: https://en.wikipedia.org/wiki/XXXXY_syndrome
HTML [7]: https://en.wikipedia.org/wiki/Trisomy_X
brigandish wrote 10 hours 41 min ago:
Not only have you undermined your claim to a Nobel
award by showing a spurious understanding of biology,
you wrote, quite sarcastically "it is impossible for
humanity to discover new information that updates our
world model". Well then, we will all await your
discovery of that 3rd gamete, or some theory so
innovative that it tips this well studied, well
understood, uncontested (by any valid competitor) model
to the wayside and humanity can revel in this new
information, the better model of reality that you
promise.
While you're at it, you could tell us all what the
scientific discovery was that made gender separate from
sex, who found it and when, and what the defining
difference is. Did they win a Nobel for that?
I request that in any reply, you refrain from spamming
me with Wikipedia links to articles you don't
understand and probably haven't read.
EnergyAmy wrote 11 hours 54 min ago:
XXY/etc are all variations within a sex. The above
poster is correct to point out that sex is defined
entirely by the gamete size that one's body is
organized around producing in anisogamous species like
humans, and is binary.
Intersex is a misleading term, the better term is [1] .
There are male DSDs and female DSDs. Even in the case
of ovotestes, you'll have one gamete produced, and the
other tissue will be nonfunctional.
HTML [1]: https://en.wikipedia.org/wiki/Disorders_of_sex...
lovich wrote 11 hours 23 min ago:
And yet, the original person I was responding to
spoke about gender.
If you are going to step into this argument, please
do not move the goalposts
edit: I've triggered the HN censor bot, so editing to
apologize to EnergyAmy, they are correct on their
point. I am still going to throw back at brigandish
that they moved the goalposts
EnergyAmy wrote 11 hours 18 min ago:
I'm responding specifically to your comment in
regards to "but if you want to talk about biology
then" followed by a list of biological variations
that don't dispute the sex binary. The goalposts
are exactly where you've left them.
Terr_ wrote 21 hours 0 min ago:
Does that mean hundreds of years of English-speakers
referring to sailing ship as "she" were all part of a
conspiracy to hide that ships have jiggly bits? :p
pezezin wrote 4 hours 19 min ago:
Wait until you find gendered languages (like most languages
in Europe) and realize that grammatical gender usually
doesn't have anything to do with biological sex :P
WalterSlovotsky wrote 22 hours 48 min ago:
The preferred method would be gender affirming surgery.
cozyman wrote 22 hours 19 min ago:
affirming would mean the change has already taken place
BobaFloutist wrote 19 hours 47 min ago:
Right, because it has. The change in gender identity (or in
choosing to make said identity more public )has already taken
place, and the surgery seems to affirm that.
mckirk wrote 1 day ago:
I'm not sure whether that was intended, but 'operating at scale'
actually made me laugh out loud :D
benterix wrote 22 hours 34 min ago:
I have to admit an unintended chuckle, too.
oncallthrow wrote 1 day ago:
It sounds to me like youâre just arguing for premature optimization
of another kind (specifically, prematurely changing your entire
architecture for edge cases that probably wonât ever happen to
you).
vintermann wrote 1 day ago:
If you have an architecture already, obviously it's hard to change
and you may want to postpone it until those edge cases which
probably won't ever happen to you, happen. But for new
architectures, value your own grey hairs over small performance
improvements.
raxxorraxor wrote 1 day ago:
> Do not assume that UUIDs are hard to guess; they should not be used
as security capabilities
The issue is that is true for more or less all capability URLs. I
wouldn't recommend UUIDs per se here, probably better to just use a
random number. I have seen UUIDs for this in practice though and these
systems weren't compromised because of that.
I hate the tendency that password recovery flows for example leave the
URL valid for 5 minutes. Of course these URLs need to have a limited
life time, but mail isn't a real time communication medium. There is
very little security benefit from reducing it from 30 minutes to 5
minutes for example. You are not getting "securer" this way.
K0nserv wrote 1 day ago:
An additional thing I learned when I worked on a ulid alternative over
the weekend[0] is: Postgres's internal Datum type is at most 64 bits
which means every uuid requires heap allocation[1] (at least until we
get 128 bit machines).
0: [1] 1:
HTML [1]: https://bsky.app/profile/hugotunius.se/post/3m7wvfokrus2g
HTML [2]: https://github.com/postgres/postgres/blob/master/src/backend/u...
qntmfred wrote 20 hours 1 min ago:
you may be interested in this postgres extension as well
HTML [1]: https://github.com/blitss/typeid-postgres
K0nserv wrote 15 hours 47 min ago:
That is indeed interesting.
I have slightly different goals for my version. I want everything
to fit in 128 bits so I'm sacrificing some of the random bits, I'm
also making sure the representation inside Postgres is also exactly
128 bits. My initial version ended up using CBOR encoding and being
160 bits.
Mine dedicates 16 bits for the prefix allowing up to 3 characters
(a-z alphabet).
reactordev wrote 1 day ago:
I fun trick I did was generate UUID-like ids. We all can identify a
UUIDv4 most of the time by looking at one. "Ah, a uuid" we say to
ourselves. A little over a decade ago I was working on a massive cloud
platform and rather than generate string keys like the author above
suggested (int -> binary -> base62 str) we opted for a more "clever"
approach.
The UUID is 128bits. The first 64bits are a java long. The last 64bits
are a java long. Let's just combine the Tenant ID long with a Resource
ID long to generate a unique id for this on our platform. (worked until
it didn't).
nrhrjrjrjtntbt wrote 1 day ago:
Atlassian settles for longer "ARIs" for this (e.g. [1] ) composed of
guids which allow a scheme like the Amazon ARN to pass around.
HTML [1]: https://developer.atlassian.com/cloud/guard-detect/developer...
reactordev wrote 1 day ago:
yeah, the problem for us was the resource id. What id was it? Was
it a post? an upload? a workspace? it wasn't nearly as descriptive
as we needed it to be.
cebert wrote 1 day ago:
Using UUIDs as primary keys in non-relational databases like DynamoDB
is valid and doesnât raise the concerns mentioned in the article.
andatki wrote 1 day ago:
Good point that the post should be made clear itâs referring only
to my experience with Postgres.
bux93 wrote 1 day ago:
Long article about why not to use UUIDv4 as Primary Keys, but.. Who is
doing so? And why are they doing that? How would you solve their
requirements? Just throwing out "you can use UUIDv7" doesn't help with,
e.g., the size they take up.
Aren't people using (big)ints are primary keys, and using UUIDs as
logical keys for import/export, solving portability across different
machines?
andatki wrote 15 hours 23 min ago:
This was written based on working on several Postgres databases at
different companies of âmediumâ size as a consultant, that had
excessive IO and latency and used UUID v4 PKs/FKs. Theyâre
definitely out there. We could transform the schema for some key
tables as a demonstration with big int equivalents and show the IO
latency reduction. With that said, the real world PK data type
migration is costly but becomes a business decision of whether to do
or not.
Sayrus wrote 1 day ago:
UUIDs are usually the go-to solution to enumeration problems. The
space is large enough that an attacker cannot guess how many X you
have (invoices, users, accounts, organizations, ...). When people
replace the ints by UUIDv4, they keep them as primary keys.
bruce511 wrote 1 day ago:
I'd add that it's also used when data is created in multiple
places.
Consider say weather hardware. 5 stations all feeding into a
central database. They're all creating rows and uploading them.
Using sequential integers for that is unnecessarily complex (if
even possible.)
Given the amount of data created on phones and tablets, this
affects more situations than first assumed.
It's also very helpful in export / edit / update situations. If I
export a subset of the data (let's say to Excel), the user can edit
all the other columns and I can safely import the result. With
integer they might change the ID field (which would be bad). With
uuid they can change it, but I can ignore that row (or the whole
file) because what they changed it to will be invalid.
nrhrjrjrjtntbt wrote 1 day ago:
Yes and the DB might be columnular or a distributed KV,
sidestepping the index problem.
dfox wrote 1 day ago:
> Creating obfuscated values using integers
While that is often neat solution, do not do that by simply XORing the
numbers with constant. Use a block cipher in ECB mode (If you want the
ID to be short then something like NSA's Speck comes handy here as it
can be instantiated with 32 or 48 bit block).
And do not even think about using RC4 for that (I've seen that multiple
times), because that is completely equivalent to XORing with constant.
socketcluster wrote 1 day ago:
My advice is: Avoid Blanket Statements About Any Technology.
I'm tired of midwit arguments like "Tech X is N% faster than tech Y at
performing operation Z. Since your system (sometimes) performs
operation Z, it implies that Tech X is the only logical choice in all
situations!"
It's an infuriatingly silly argument because operation Z may only
represent about 10% of the total CPU usage of the whole system
(averaged out)... So what is promoted as a 50% gain may in fact be a 5%
gain when you consider it in the grand scheme of things... Negligible.
If everyone was looking at this performance 'advantage' rationally;
nobody would think it's worth sacrificing important security or
operational properties.
I don't know what happened to our industry; we're supposed to be
intelligent people but I see developers falling for these obvious
logical fallacies over and over.
I remember back in my day, one of the senior engineers was discussing
upgrading a python system and stated openly that the new version of the
engine was something like 40% slower than the old version but he didn't
even have to explain himself why upgrading was still a good decision;
everybody in the company knew he was only talking about the code
execution speed and everybody knew that this was a small fraction of
the total.
Not saying UUIDv7 was a bad choice for Postgres. I'm sure it's fine for
a lot of situations but you don't have to start a cult preaching the
gospel of The One True UUID to justify your favorite project's
decisions.
I do find it kind of sly though how the community decided to make this
UUIDv7 instead of creating a new standard for it.
The whole point of UUID was to leverage the properties of randomness to
generate unique IDs without requiring coordination. UUIDv7 seems to
take things in a philosophically different path. People chose UUID for
scalability and simplicity (both of which you get as a result of doing
away with the coordination overhead), not for raw performance...
That's the other thing which drives me nuts; people who don't
understand the difference between performance and scalability. People
foolishly equate scalability with parallelism or concurrency; whereas
that's just one aspect of it; scalability is a much broader topic. It's
the difference between a theoretical system which is fast given a
certain artificially small input size and one which actually performs
better as the input size grows.
Lastly; no mention is made about the complex logic which has to take
place behind the scenes to generate UUIDv7 IDs... People take it for
granted that all computers have a clock which can produce accurate
timestamps where all computers in the world are magically in-sync...
UUIDv7 is not simple; it's very complicated. It has a lot of additional
complexity and dependencies compared to UUIDv4. Just because that
complexity is very well hidden from most developers, doesn't mean it's
not there and that it's not a dependency... This may become especially
obvious as we move to a world of robotics and embedded systems where
cheap microchips may not have enough Flash memory to hold the code for
the kinds of programs required to compute such elaborate IDs.
thraxil wrote 1 day ago:
Yep. We have tables that use UUIDv4 that have 60M+ rows and don't
have any performance problems with them. Would some queries be faster
using something else? Probably, but again, for us it's not close to
being a bottleneck. If it becomes a problem at 600M or 6B rows, we'll
deal with it then. We'll probably switch to UUIDv7 at some point, but
it's not a priority and we'll do some tests on our data first. Does
my experience mean you should use UUIDv4? No. Understand your own
system and evaluate how the tradeoffs apply to you.
ezekg wrote 21 hours 48 min ago:
I have tables that have billions of rows that use UUIDv4 primary
keys and I haven't encountered any issues either. I do use UUIDv7
for write-heavy tables, but even then, I got a way bigger
performance boost from batching inserts than switching from UUIDv4
to UUIDv7. Issue is way overblown.
p2detar wrote 23 hours 7 min ago:
Nice feedback. Out of curiosity, have you made any fine-tuning to
psql that greatly improved performance?
thraxil wrote 6 hours 15 min ago:
Nope. Out of the box GCP Cloud SQL instance.
kunley wrote 1 day ago:
Wasn't choosing uuids as ids falling for the deceptive argument in
the first place?
christophilus wrote 1 day ago:
Not really, no. Theyâre very convenient for certain problems and
work really well in general. Iâve never had a performance issue
where the problem boiled down to my use of UUID.
danparsonson wrote 1 day ago:
You never having seen the problem doesn't mean it never happens;
I have dealt with a serious performance problem in the past that
was due to excessive page fragmentation due to a GUID PK.
To your original point, these are heuristics; there isn't always
time to dig into every little architectural decision, so having a
set of rules of thumb on hand helps to preempt problems at
minimal cognitive cost. "Avoid using a GUID as a primary key if
you can" is one of mine.
kunley wrote 1 day ago:
What are these certain problems, if I may ask?
socketcluster wrote 1 day ago:
A major one for me is preventing duplicate records.
If the client POSTs a new object to insert it into the
database; if there is a connection failure and the client does
not receive a success response from the server, the client
cannot know whether the record was inserted or not without
making an expensive and cumbersome additional read call to
check... The client cannot simply assume that the insertion did
not happen purely on the basis that they did not receive a
success response. It could very well be that the insertion
succeeded but the connection failed shortly after so response
was not received. If the IDs are auto-incremented on the server
and the client posts the same object again without any ID on
it, the server will create a duplicate record in the database
table (same object with a different ID).
On the other hand, if the client generates a UUID for the
object it wants to create on the front-end, then it can safely
resend that exact object any number of times and there is no
risk of double-insertion; the object will be rejected the
second time and you can show the user a meaningful error
"Record was already created" instead of creating two of the
same resource; leading to potential bugs and confusion.
kunley wrote 1 day ago:
Ehm.. so you're saying that INSERT ... RETURNING id is not
atomic from the client's pov because something terrible could
happen just when client is receiving the answer inside its
SQL driver?
socketcluster wrote 1 day ago:
I'm actually more thinking about the client sitting on the
front-end like a single page app. Network instability could
cause the response to not reach the front-end after a
successful insert. This wouldn't be extremely common but
would definitely be a problem for you as the database admin
if you have above a certain number of users. I've seen this
issue on live production systems and the root cause of
duplicate records can be baffling because of how
infrequently it may happen. Tends to cause issues that are
hard to debug.
kunley wrote 21 hours 36 min ago:
OK got it. I was thinking about SQL client, not about
client of a REST service. With that distinction in mind,
the reasoning makes sense; thank you.
mkleczek wrote 1 day ago:
Preferably, you would design you APIs and services to be
idempotent (ie. use PUT not POST etc.)
Using idempotency identifier is the last resort in my book.
dpark wrote 19 hours 59 min ago:
How is URI not an idempotency identifier for PUT?
socketcluster wrote 1 day ago:
Still, UUID is probably the simplest and most reliable way
to generate such idempotency identifiers.
scary-size wrote 1 day ago:
This reminds me about this old gist for generating Firebase-like "push
IDs" [1]. Those have some nicer properties.
HTML [1]: https://gist.github.com/mikelehen/3596a30bd69384624c11
Lucasoato wrote 1 day ago:
Hi, a question for you folks. What if I donât like to embed timestamp
in uuid as v7 do? This could expose to timing attacks in specific
scenarios.
Also is it necessary to show uuid at all to customers of an API? Or
could it be a valid pattern to hide all the querying complexity behind
named identifiers, even if it could cost a bit in terms of joining and
indexing?
The context is the classic B2B SaaS, but feel free to share your
experiences even if it comes from other scenarios!
lwhi wrote 1 day ago:
Wouldn't you need to expose UUID if you want to make use of
optimistic locking?
Lucasoato wrote 1 day ago:
I feel that this is among the good reasons to keep exposing UUID in
the API.
dimitrisnl wrote 1 day ago:
Noob question, but why no use ints for PK, and UUIDs for a public_id
field?
scotty79 wrote 17 hours 19 min ago:
One of the benefits of UUIDs is that you can easily merge data coming
from multiple databases. Auto-increments cause collisions.
dsego wrote 1 day ago:
I also think we can use a combination of a PID - persistent ID (I
always thought it was public) and an auto-increment integer ID.
Having a unique key helps when migrating data between systems or
referencing a piece of data in a different system. Also, using serial
IDs in URLs and APIs can reveal sensitive information, e.g. how many
items there are in the database.
alerighi wrote 1 day ago:
If you put an index on the UUID field (because you have an API where
you can retrieve objects with UUID) you have kind of the same
problem, at least in Postgres where a primary key index or a
secondary index are more or less the same (to the point is perfectly
valid in pgsql to not have any primary key defined for the table,
because storage on disk is done trough an internal ID and the
indexes, being primary or not, just reference to the rowId in
memory). Plus the waste of space of having 2 indexes for the same
table.
Of course this is not always the case that is bad, for example if you
have a lot of relations you can have only one table where you have
the UUID field (and thus expensive index), and then the relations
could use the more efficient int key for relations (for example you
have an user entity with both int and uuid keys, and user attribute
references the user with the int key, of course at the expense of a
join if you need to retrieve one user attribute when retrieving the
user is not needed).
torginus wrote 1 day ago:
You can create hash indexes in Postgres, so the secondary index
uuid seems workable:
HTML [1]: https://www.postgresql.org/docs/current/hash-index.html
grim_io wrote 1 day ago:
The article mentions microservices, which can increase the likelihood
of collisions in sequential incremental keys.
One more reason to stay away from microservices, if possible.
mrkeen wrote 1 day ago:
The 'collision' is two service classes both trying to use one db.
If you separate them (i.e. microservices) the they no longer try to
use one db.
grim_io wrote 1 day ago:
There is nothing stopping multiple microservices from using the
same DB, so of course this will happen in practice.
Sometimes it might even be for a good reason.
bardsore wrote 1 day ago:
Always try to avoid having two services using the same DB. Only way
I'd ever consider sharing a DB is if only one service will ever
modify it and all others only read.
grim_io wrote 1 day ago:
Good luck enforcing that :)
edding4500 wrote 1 day ago:
*edit: sorry, misread that. My answer is not valid to your question.
original answer: because if you dont come up with these ints randomly
they are sequential which can cause many unwanted situations where
people can guess valid IDs and deduce things from that data. See
HTML [1]: https://en.wikipedia.org/wiki/German_tank_problem
javaunsafe2019 wrote 1 day ago:
So We make things hard in the backend because of leaky
abstractions? Doesn't make sense imo.
jcims wrote 1 day ago:
Decades of security vulnerabilities and compromises because of
sequential/guessable PKs is (only!) part of the reason we're
here. Miss an authorization check anywhere in the application
and you're spoon-feeding entire tables to anyone with the
inclination to ask for it.
javawizard wrote 1 day ago:
Hence the presumed implication behind the public_id field in GP's
comment: anywhere identifiers are exposed, you use the public_id
field, thereby preventing ID guessing while still retaining the
benefits of ordered IDs where internal lookups are concerned.
Edit: just saw your edit, sounds like we're on the same page!
mcny wrote 1 day ago:
Postgresql 18 released in September and has uuidv7
HTML [1]: https://www.postgresql.org/docs/current/functions-uuid.html
waynenilsen wrote 1 day ago:
What kills me is I canât double click the thing to select it.
mrits wrote 1 day ago:
This application specific. iTerm2 doesn't break up by - why firefox
does.
dotancohen wrote 1 day ago:
From the fine article:
> Random values donât have natural sorting like integers or
lexicographic (dictionary) sorting like character strings. UUID v4s do
have "byte ordering," but this has no useful meaning for how theyâre
accessed.
Might the author mean that random values are not sequential, so
ordering them is inefficient? Of course random values can be ordered -
and ordering by what he calls "byte ordering" is exactly how all
integer ordering is done. And naive string ordering too, like we would
do in the days before Unicode.
andatki wrote 19 hours 31 min ago:
Hi there. Thanks for the feedback. I updated that section to
hopefully convey the intent more. The type of ordering we care about
for this topic is really B-Tree index traversal when inserting new
entries and finding existing entries (single and multiple values i.e.
an IN clause, updates, deletes etc). There's a compelling example I
re-created from Cybertec showing the pages needed and accessed for
equivalent user-facing results, comparing storing PKs as big integers
vs. UUID v4s, and how many more pages were needed for v4 UUIDs. I
found that to be helpful to support my real world experience as a
consultant on various "medium sized" Postgres databases (e.g. single
to 10s of millions of records) where clients were experiencing
excessive latency for queries, and the UUID v4 PK/FKs selection made
for reasons earlier was one of the main culprits. The indexes
wouldnât fit into memory resulting in a lot of sequential scans.
Iâd confirm this by showing an alternative schema design and set of
queries where everything was the same except integer PKs/FKs were
used. Smaller indexes (fit in memory), reliable index scans, less
latency, faster execution time.
hashmush wrote 22 hours 58 min ago:
Agree, I did a double take on this too.
Values of the same type can be sorted if a order is defined on the
type.
It's also strange to contrast "random values" with "integers". You
can generate random integers, and they have a "sorting" (depending on
what that means though)
dev_l1x_be wrote 1 day ago:
Why would you need to order by UUID? I am missing something here.
Most of the time we use UUID keys for being able to create a new key
without coordination and most of the time we do not want to order by
primary key.
sagarm wrote 1 day ago:
Most common database indexes are ordered, so if you are using
UUIDv4 you will not only bloat the index you will also have poor
locality. If you try to use composite keys to fix locality, you'll
end up with an even more bloated index.
hans_castorp wrote 1 day ago:
I have seen a lot of people sort by (generated) integer values to
return the rows "in creation order" assuming that sorting by an
integer is somehow magically faster than sorting by a proper
timestamp value (which give a more robust "creation order" sorting
than a generated integer value).
sgarland wrote 1 day ago:
Assuming the integer value is the PK, it can in fact be much
faster for MySQL / MariaDB due to InnoDBâs clustering index. If
it can do a range scan over the PK, and thatâs also the ORDER
BY (with matching direction), congratulations, the rows are
already ordered, no sort required. If it has to do a secondary
index lookup to find the rows, this is not guaranteed.
K0nserv wrote 1 day ago:
Isn't part of this that inserting into a btree index is more
performant when the keys are increasing rather than being random? A
random id will cause more re-balancing operations than always
inserting at the end. Increasing ids are also more cache friendly
sgarland wrote 1 day ago:
Yes, and for Postgres, it also causes WAL bloat due to the high
likelihood of full page writes.
dotancohen wrote 5 hours 8 min ago:
Could you expand on this? I use postgres often and though I could
have an LLM explain what you mean, I think I'd learn more to hear
it from you. Thank you.
crest wrote 1 day ago:
Any fixed sized bitstring has an obvious natural ordering, but since
they're allocated randomly they lack the density and locality of
sequential allocation.
torginus wrote 1 day ago:
To be polite, I don't think this article rests on sound technical
foundations.
mrinterweb wrote 17 hours 29 min ago:
How so?
dagss wrote 1 day ago:
The point is how closely located data you access often is. If data is
roughly sorted by creation time then data you access close to one
another in time is stored close to one another on disk. And typically
access to data is correlated with creation time. Not for all tables
but for many.
Accessing data in totally random locations can be a performance
issue.
Depends on lots of things ofc but this is the concern when people
talk about UUID for primary keys being an issue.
kreetx wrote 1 day ago:
Using an UUIDv4 as primary key is a trade-off: you use it when you
need to generate unique keys in a distributed manner. Yes, these are
not datetime ordered and yes, they take 128 bits of space. If you
can't live with this, then sure, you need to consider alternatives. I
wonder if "Avoid UUIDv4 Primary Keys" is a rule of thumb though.
torginus wrote 1 day ago:
I do not understand why 128 bits is considered too big - you
clearly can't have less, as on 64 bits the collision probability on
real world workloads is just too high, for all but the smallest
databases.
Auto-incrementing keys can work, but what happens when you run out
of integers? Also, distributed dbs probably make this hard, and
they can't generate a key on client.
There must be something in Postgres that wants to store the records
in PK order, which while could be an okay default, I'm pretty sure
you can this behavior, as this isn't great for write-heavy
workloads.
p1necone wrote 20 hours 36 min ago:
I doubt many real world use cases would run out of incrementing
64 bit ids - collisions if they were random sure, but i64 max is
9,223,372,036,854,775,807 - if each row took only 1 bit of space,
that would be slightly more than an exabyte of data.
vbezhenar wrote 23 hours 19 min ago:
You won't run out of 64-bit integer. IMO, 64-bit integer (and
even less for some tables that's not expected to grow much) it
the best approach for internal database ID. If you want to expose
ID, it might make sense to introduce second UUID for selected
tables, if you want to hide internal ID.
anarazel wrote 1 day ago:
The issue is more fundamental - if you have purely random keys,
there's basically no spatial locality for the index data. Which
means that for decent performance your entire index needs to be
in memory, rather than just recent data. And it means that you
have much bigger write amplification, since it's rare that the
same index page is modified multiple times close-enough in time
to avoid a second write.
dotancohen wrote 1 day ago:
If one needs timestamp ordering, then UUIDv7 is a good alternative.
But the author does not say timestamp ordering, he says ordering. I
think he actually means and believes that there is some problem
ordering UUIDv4.
kreetx wrote 1 day ago:
Yup. There are alternatives depending on what the situation is:
with non-distributed, you could just use a sufficiently sized int
(which can be rather small when the table is for e.g humans). You
could add a separate timestamp column if that is important.
But if you need UUID-based lookup, then you might as well have it
as a primary key, as that will save you an extra index on the
actual primary key. If you also need a date and the remaining
bits in UUIDv7 suffice for randomness, then that is a good option
too (though this does essentially amount to having a composite
column made up of datetime and randomness).
xandrius wrote 1 day ago:
To summarise the article: in PG, prefer using UUIDv7 over UUIDv4 as
they have slightly better performance.
If you're using latest version of PG, there is a plugin for it.
That's it.
hans_castorp wrote 1 day ago:
With the latest Postgres version (>= 18) you do NOT need a plugin
sbuttgereit wrote 1 day ago:
You might have missed the big H2 section in the article:
"Recommendation: Stick with sequences, integers, and big integers"
After that then, yes, UUIDv7 over UUIDv4.
This article is a little older. PostgreSQL didn't have native
support so, yeah, you needed an extension. Today, PostgreSQL 18 is
released with UUIDv7 support... so the extension isn't necessary,
though the extension does make the claim:
"[!NOTE] As of Postgres 18, there is a built in uuidv7() function,
however it does not include all of the functionality below."
What those features are and if this extension adds more cruft in
PostgreSQL 18 than value, I can't tell. But I expect that the vast
majority of users just won't need it any more.
tmountain wrote 1 day ago:
Sticking with sequences and other integer types will cause problems
if you need to shard later.
SoftTalker wrote 22 hours 16 min ago:
This is mentioned, and in many applications you can safely say
you will never need to shard.
tmountain wrote 2 hours 21 min ago:
Yes, but if you do need to, it's much simpler if you were using
UUID since the beginning. I'm personally not convinced that any
of the tradeoffs that comes with a more traditional key are
worth the headache that could come in a scenario where you do
need to shard. I started a company last year, and the DB has
grown wildly beyond our expectations. I did not expect this,
and it continues to grow (good problem to have). It happens!
bigmadshoe wrote 22 hours 28 min ago:
Iâm really no expert on sharding but if youâre using
increasing ints why canât you just shard on (id % n) or
something?
0x457 wrote 21 hours 8 min ago:
Because then you run into an issue when you 'n' changes. Plus,
where are you increasing it on? This will require a single
fault-tolerant ticker (some do that btw).
Once you encode shard number into ID, you got:
- instantly* know which shard to query
- each shard has its own ticker
* programatically, maybe visually as well depending on
implementation
I had IDs that encode: entity type (IIRC 4 bit?), timestamp,
shard, sequence per shard. We even had a admin page wher you
can paste ID and it will decode it.
id % n is fine for cache because you can just throw whole thing
away and repopulate or when 'n' never changes, but it usually
does.
tmountain wrote 2 hours 23 min ago:
^ This
sgarland wrote 1 day ago:
There are plenty of ways to deal with that. You can shard by some
other identifier (though I then question your table design), you
can assign ranges to each shard, etc.
zwnow wrote 1 day ago:
Especially in larger systems, how does one solve the issue of
reaching the max value of an integer in their database? Sure for
unsigned bigint thats hard to achieve but regular ints? Apps
quickly outgrow that.
sbuttgereit wrote 1 day ago:
OK... but that concern seems a bit artificial.. if bigints are
appropriate: use them. If the table won't get to bigint sizes:
don't. I've even used smallint for some tables I knew were
going to be very limited in size. But I wouldn't worry about
smallint's very limited number of values for those tables that
required a larger size for more records: I'd just use int or
bigint for those other tables as appropriate. The reality is
that, unless I'm doing something very specific where being
worried about the number of bytes will matter... I just use
bigint. Yes, I'm probably being wasteful, but in the cases
where those several extra bytes per record are going to really
add up.... I probably need bigint anyway and in cases where
bigint isn't going to matter the extra bytes are relatively
small in aggregate. The consistency of simply using one type
itself has value.
And for those using ints as keys... you'd be surprised how many
databases in the wild won't come close to consuming that many
IDs or are for workloads where that sort of volume isn't even
aspirational.
Now, to be fair, I'm usually in the UUID camp and am using
UUIDv7 in my current designs. I think the parent article makes
good points, but I'm after a different set of trade-offs where
UUIDs are worth their overhead. Your mileage and use-cases may
vary.
zwnow wrote 1 day ago:
Idk I use whatever scales best and that would be an close to
infinite scaling key. The performance compromise is probably
zeroed out once you have to adapt ur database to a different
one supporting the current scale of the product. Thats for
software that has to scale. Whole different story for stuff
that doesnt have to grow obviously. I am in the UUID camp too
but I dont care whether its v4 or v7.
wongarsu wrote 1 day ago:
It's not like there are dozens of options and you
constantly have to switch. You just have to estimate if at
maximum growth your table will have 32 thousand, 2 billion
or 9 quintillion entries. And even if you go with 9
quintillion for all cases you still use half the space of a
UUID
UUIDv4 are great for when you add sharding, and UUIDs in
general prevent issues with mixing ids from different
tables. But if you reach the kind of scale where you have 2
billion of anything UUIDs are probably not the best choice
either
benterix wrote 1 day ago:
The article sums up some valid arguments against UUIDv4 as PKs but the
solution the author provides on how to obfuscate integers is probably
not something I'd use in production. UUIDv7 still seems like a
reasonable compromise for small-to-medium databases.
stickfigure wrote 22 hours 31 min ago:
In Postgres I often like to use a single sequence for everything. It
leaks some information yes but in a busy system it tends to be
"obscure enough".
x0x0 wrote 20 hours 53 min ago:
It's not leaking that's the concern. It's that not having the
names of objects be easily enumerable is a strongly
security-enhancing feature of a system.
Yes of course everyone should check and unit test that every object
is owned by the user or account loading it, but demanding more
sophistication from an attacker than taking "/my_things/23" and
loading "/my_things/24" is a big win.
stickfigure wrote 18 hours 2 min ago:
With a single sequence and a busy system, the ids for most
high-level tables/collection are extremely sparse. This doesn't
mean they can't be enumerated, but you will probably notice if
you suddenly start getting hammered with 404s or 410s or whatever
your system generates on "not found".
Also, if most of your endpoints require auth, this is not
typically a problem.
It really depends on your application. But yes, that's something
to be aware of. If you need some ids to be unguessable, make sure
they are not predictable :-)
thayne wrote 12 hours 25 min ago:
If you have a busy system, a single sequence is going to be a
pretty big performance bottleneck, since every resource
creation will need to acquire a lock on that sequence.
x0x0 wrote 16 hours 35 min ago:
> Also, if most of your endpoints require auth, this is not
typically a problem.
Many systems are not sparse, and separately, that's simply
wrong. Unguessable names is not a primary security measure,
but a passive remediation for bugs or bad code. Broken access
control remains an owasp top 10, and idor is a piece of that.
Companies still get popped for this.
See, eg, google having a bug in 2019, made significantly less
impactful by unguessable names
HTML [1]: https://infosecwriteups.com/google-did-an-oopsie-a-sim...
formerly_proven wrote 1 day ago:
If all you want is to obfuscate the fact that your social media site
only has 200 users and 80 posts, simply use a permutation over the
autoincrement primary key. E.g. IDEA or CAST-128, then encode in
base64. If someone steps on your toes because somewhere in your
codebase you're using a forbidden legacy cipher, just use AES-128.
(This is sort of the degenerate/tautological base case of
format-preserving encryption)
(What do you think Youtube video IDs are?)
conradfr wrote 1 day ago:
Can't you just change the starting value of your sequence?
enz wrote 1 day ago:
The problem with this approach is that you now have to manage a
secret key/secret for a (maybe) a very long time.
I shared this article a few weeks ago, discussing the problems with
this kind of approach: [1] I believe it can make sense in some
situations, but do you really want to implement such crypto-related
complexity?
HTML [1]: https://notnotp.com/notes/do-not-encrypt-ids/
formerly_proven wrote 23 hours 54 min ago:
The article is self-contradictory in that it acts like that key
is super-important ("Operations becomes a nightmare. You now have
a cryptographic secret to manage. Where does this key live?
Protected by a wrapping key living in a KMS or HSM? Do you use
the same key across prod, staging, and dev? If dev needs to test
with prod data, does it need access to prod encryption keys? What
about CI pipelines? Local developer machines?") but then also
acknowledges that we're talking about an obfuscation layer of
stuff which is not actually sensitive ("to hide timestamps that
aren't sensitive"). Don't get me wrong, it's a definitive
drawback for scaling the approach, but most applications have to
manage various secrets, most of which are actually important.
E.g. session signing keys, API keys etc. It's still common for
applications to use signed session with RCE data formats. The
language from that article, while not wrong, is much more apt for
those keys.
That being said, while fine for obfuscation, it should not be
used for security for this purpose, e.g. hidden/unlisted links,
confirmation links and so on. Those should use actual, long-ish
random keys for access, because the inability to enumerate them
is a security feature.
Retr0id wrote 1 day ago:
Why not use AES-128 by default? Your CPU has instructions to
accelerate AES-128.
benterix wrote 1 day ago:
I always thought they are used and stored as they are because the
kind of transformation you mention seems terribly expensive given
the YT's scale, and I don't see a clear benefit of adding any kind
of obfuscation here.
pdimitar wrote 1 day ago:
> What do you think Youtube video IDs are?
I actually haven no idea. What are they?
(Also what is the format of their `si=...` thing?)
intalentive wrote 23 hours 3 min ago:
Canât recall where I heard this, but Iâm pretty sure the
si=⦠is tracking information that associates the link with the
user who shared it.
pdimitar wrote 22 hours 42 min ago:
Oh absolutely, I am just wondering _what_ does it contain.
mort96 wrote 1 day ago:
I tend to avoid UUIDv7 and use UUIDv4 because I don't want to leak
the creation times of everything.
Now this doesn't work if you actually have enough data that the
randomness of the UUIDv4 keys is a practical database performance
issue, but I think you really have to think long and hard about every
single use of identifiers in your application before concluding that
v7 is the solution. Maybe v7 works well for some things (e.g
identifiers for resources where creation times are visible to all
with access to the resource) but not others (such as users or orgs
which are publicly visible but without publicly visible creation
times).
barrkel wrote 1 day ago:
You shouldn't generally use PKs as public identifiers, least of all
UUIDs, which are pretty user hostile.
mort96 wrote 1 day ago:
I really don't see the issue with having a UUID in a URL.
throw0101a wrote 1 day ago:
> I tend to avoid UUIDv7 and use UUIDv4 because I don't want to
leak the creation times of everything.
See perhaps "UUIDv47 â UUIDv7-in / UUIDv4-out (SipHashâmasked
timestamp)":
* [1] * Sept 2025:
HTML [1]: https://github.com/stateless-me/uuidv47
HTML [2]: https://news.ycombinator.com/item?id=45275973
wongarsu wrote 1 day ago:
If that kind of stuff is on the able you can also use boring
64bit integer keys and encrypt those (e.g. [1]). Which in the end
is just a better thought out version of what the article author
did.
UUIDv47 might have a space if you need keys generated on multiple
backend servers without synchronization. But it feels very niche
to me.
1:
HTML [1]: https://wiki.postgresql.org/wiki/XTEA_(crypt_64_bits)
nbadg wrote 1 day ago:
I'm also not a huge fan of leaking server-side information; I
suspect UUIDv7 could still be used in statistical analysis of the
keyspace (in a similar fashion to the german tank problem for
integer IDs). Also, leaking data about user activity times (from
your other comment) is a *really* good point that I hadn't
considered.
I've read people suggest using a UUIDv7 as the primary key and a
UUIDv4 as a user-visible one as a remedy.
My first thought when reading the suggestion was, "well but you'll
still need an index on the v4 IDs, so what does this actually get
you?" But the answer is that it makes joins less expensive; you
only require the index once, when constructing the query from the
user-supplied data, and everything else operates with the
better-for-performance v7 IDs.
To be clear, in a practical sense, this is a bit of a
micro-optimization; as far as I understand it, this really only
helps you by improving the data locality of temporally-related
items. So, for example, if you had an "order items" table,
containing rows of a bunch of items in an order, it would speed up
retrieval times because you wouldn't need to do as many index
traversals to access all of the items in a particular order. But
on, say, a users table (where you're unlikely to be querying for
two different users who happen to have been created at
approximately the same time), it's not going to help you much. Of
course the exact same critique is applicable to integer IDs in
those situations.
Although, come to think of it, another advantage of a user-visible
v4 with v7 Pk is that you could use a different index type on the
v4 ID. Specifically, I would think that a hash index for the
user-visible v4 might be a halfway-decent way to go.
I'm still not sure either way if I like the idea, but it's
certainly not the craziest thing I've ever heard.
thayne wrote 12 hours 57 min ago:
I think a bigger benefit from doing that would be that inserts
would be cheaper. Instead of an expensive insert into the middle
of an index for every table that needs an index on that key, you
can do a cheaper insert at the end of the index for all of them
except for the one that uses uuid4.
But if you are doing that, why not just use an incrementing
integer instead of a uuidv7?
nbadg wrote 7 hours 34 min ago:
Certainly for many applications, the autoint approach would be
fine.
The benefit of uuid in this case is that it allows horizontally
scalable app servers to construct PKs on their own without risk
of collisions. In addition to just reducing database load by
doing the ID generation on the app server (admittedly usually a
minor benefit), this can be useful either to simplify insert
queries that span multiple tables with FK relationships
(potentially saving some round trips in the process) or in very
niche situations where you have circular dependencies in
non-nullable FKs (with the constraint deferred until the end of
the transaction).
cdmckay wrote 1 day ago:
Out of curiosity, why is it an issue if you leak creation time?
dboreham wrote 1 day ago:
Apart from all the other answers here: an external entity knowing
the relative creation time for two different accounts, or just
that the two accounts were created close in time to each other
can represent a meaningful information leak.
natch wrote 1 day ago:
If your system (pseudo-) random number generator (RNG) is
compromised to derive a portion of its entropy from things that
are knowable by knowing the time when the function ran, then the
search space for cracking keys created around the same time can
be shrunken considerably.
This doesnât even rely on your systemâs built-in RNG being
low quality. It could be audited and known to avoid such issues
but you could have a compromised compiler or OS that injects a
doctored RNG.
saaspirant wrote 1 day ago:
There was a HN comment about competitors tracking how many new
signups are happening and increasing the discounts/sales push
based on that. Something like this.
0x3f wrote 1 day ago:
That's happening everywhere. You can order industrial parts
from a Fortune 500 and check some of the numbers on it too, if
they're not careful about it.
JetSetIlly wrote 1 day ago:
In a business I once worked for, one of the users of the online
ordering system represented over 50% of the business' income,
something you wouldn't necessarily want them to know.
However, because the online ordering system assigned order
numbers sequentially, it would have been trivial for that
company to determine how important their business was.
For example, over the course of a month, they could order
something at the start of the month and something at the end of
the month. That would give them the total number of orders in
that period. They already know how many orders they have placed
during the month, so company_orders / total_orders =
percentage_of_business
It doesn't even have to be accurate, just an approximation. I
don't know if they figured out that they could do that but it
wouldn't surprise me if they had.
pezezin wrote 17 hours 27 min ago:
This is also something that depends heavily on regulations.
In my home country, invoice numbers have to be sequential by
law, although you can restart the numbering every year.
JetSetIlly wrote 7 hours 36 min ago:
Yes, even if it's not a legal requirement it's definitely
best practice to have sequential invoice numbers. I thought
about this at the time but these numbers aren't invoice
numbers, only order numbers.
hexbin010 wrote 13 hours 43 min ago:
A global sequence, or sequence-per-account? I suspect the
latter?
pezezin wrote 4 hours 28 min ago:
A sequence per "series", where a series can be a fiscal
year, a department or category, etc. But I am not sure if
you can have one series per customer, I only find
conflicting information.
You can have more details here, in the section "Complete
invoice": [1] [2] (Spanish only)
HTML [1]: https://sede.agenciatributaria.gob.es/Sede/en_gb...
HTML [2]: https://www.boe.es/buscar/act.php?id=BOE-A-2012-...
Bombthecat wrote 1 day ago:
Admins, early users, founders, CEOs etc etc would have althe
lowest creation time...
bruce511 wrote 1 day ago:
The issue will be very context specific. In other words to
(reasonably) answer the question, we'd have to judge each
application individually.
For one example, say you were making voting-booth software. You
really don't want a (hidden) timestamp attached to each vote
(much less an incrementing id) because that would break voter
confidentiality.
More generally, it's more a underlying principle of data
management. Not leaking ancillary data is easier to justify than
"sure we leak the date and time of the record creation, but we
can't think of a reason why that matters."
Personally I think the biggest issue are "clever" programmers who
treat the uuid as data and start displaying the date and time.
This leads to complications ("that which is displayed, the
customer wants to change"). It's only a matter of time before
someone declares the date "wrong" and it must be "fixed". Not to
mention time zone or daylight savings conversions.
mort96 wrote 1 day ago:
Well you're leaking user data. I'm sure you can imagine
situations where "the defendant created an account on this site
on this date" could come up. And the user could have created that
account not knowing that the creation date is public, because
it's not listed anywhere in the publicly viewable part of the
profile other than the UUID in the URL.
koakuma-chan wrote 1 day ago:
Discord is doing fine.
mort96 wrote 1 day ago:
Hacker news is also doing fine, even though I can just click
your profile and see you joined in october 2024. It doesn't
matter for every use case.
But there are cases where it matters. Using UUIDv7 for
identifiers means you need to carefully consider the security
and privacy implications every time you create a new table
identified by a UUID, and you'll possibly end up with some
tables where you use v4 and some where you use v7. Worst
case, you'll end up with painful migrations from v7 to v4 as
security review identifies timestamped identifiers as a
security concern.
nish__ wrote 1 day ago:
Pretty much every social media app has a "Member since X"
visible on public profiles. I don't think it's an issue.
mort96 wrote 1 day ago:
Who said I was talking about social media?
nish__ wrote 1 day ago:
Well where else do users have public profiles?
0x3f wrote 1 day ago:
The whole point though is that the ID itself leaks info,
even if the profile is not public. There are many cases
where you reference an object as a foreign key, even if
you can't see the entire record of that foreign key.
nish__ wrote 22 hours 10 min ago:
I can't think of any.
strbean wrote 18 hours 59 min ago:
Sending a friend request is an obvious example.
kreetx wrote 1 day ago:
E.g, if your service users have timestamp as part of the key and
this data is visible to other users, you would know when that
account was created. This could be an issue.
robertlagrant wrote 1 day ago:
Depends on the data. If you use a primary key in data about a
person that shouldn't include their age (e.g. to remove age-based
discrimination) then you are leaking an imperfect proxy to their
age.
lwhi wrote 1 day ago:
So the UUID could be used as an imperfect indicator of a
records created time?
benterix wrote 1 day ago:
UUIDv7 but not UUIDv4.
lwhi wrote 1 day ago:
I suppose timing attacks become an issue too.
wongarsu wrote 1 day ago:
UUIDv7 still have a lot of random bits. Most attacks
around creating lots of ids are foiled by that
jwr wrote 1 day ago:
"if you use PostgreSQL"
(in the scientific reporting world this would be the perennial "in
mice")
kijin wrote 1 day ago:
The space requirement and index fragmentation issue is nearly the
same no matter what kind of relational database you use. Math is
math.
Just the other day I delivered significant performance gains to a
client by converting ~150 million UUIDv4 PKs to good old BIGINT. They
were using a fairly recent version of MariaDB.
splix wrote 1 day ago:
I think the author means all dbs that fit a single server. Because
in distributed dbs you often want to spread the load evenly over
multiple servers.
esafak wrote 1 day ago:
To spell it out: it improves performance by avoiding hot spots.
zelphirkalt wrote 1 day ago:
If they can live with making keys only in one place, then sure,
this can work. If however they need something that is very highly
likely unique, across machines, without the need to sync, then
using a big integer is no good.
if they can live with MariaDB, OK, but I wouldn't choose that in
the first place these days. Likely Postgres will also perform
better in most scenarios.
kijin wrote 1 day ago:
Yeah, they had relatively simple requirements so BIGINT was a
quick optimization. MariaDB can guarantee uniqueness of
auto-incrementing integers across a cluster of several servers,
but that's about the limit.
Had the requirements been different, UUIDv7 would have worked
well, too, because fragmentation is the biggest problem here.
orthoxerox wrote 1 day ago:
It's not just Postgres or even OLTP. For example, if you have an
Iceberg table with SCD2 records, you need to regularly locate and
update existing records. The more recent a record is, the more likely
it is to be updated.
If you use UUIDv7, you can partition your table by the key prefix.
Then the bulk of your data can be efficiently skipped when applying
updates.
andatki wrote 15 hours 4 min ago:
Good addition!
hyperpape wrote 1 day ago:
The thing is, none of us are mice, but many of us use Postgres.
It would be the equivalent of "if you're a middle-aged man" or
"you're an American".
P.S. I think some of the considerations may be true for any system
that uses B-Tree indexes, but several will be Postgres specific.
DIR <- back to front page