codevoid.de/1/hn/comments_47103649.gph

        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   Cloudflare outage on February 20, 2026
       
       
        snowhale wrote 8 hours 7 min ago:
        the 'empty string = select all' pattern in the delete path is the kind
        of bug that static typing and explicit null handling would catch at
        compile time. when your delete API accepts a bare query string, you're
        one missed validation away from this. probably the deeper lesson is
        that destructive operations should require explicit confirmation of
        scope, not just 'no filter = everything.'
       
        est wrote 9 hours 4 min ago:
        bitbucket was done for a while as well. Seems no one noticed.
       
        abalone wrote 9 hours 10 min ago:
        The code they posted doesn't quite explain the root cause. This is a
        good study case for resilient API design and testing.
        
        They said their /v1/prefixes endpoint has this snippet:
        
          if v := req.URL.Query().Get("pending_delete"); v != "" {
              // ignore other behavior and fetch pending objects from the
        ip_prefixes_deleted table
              prefixes, err :=
        c.RO().IPPrefixes().FetchPrefixesPendingDeletion(ctx)
              
              [..snip..]
          }
        
        What's implied but not shown here is that endpoint normally returns all
        prefixes. They modified it to return just those pending deletion when
        passing a pending_delete query string parameter.
        
        The immediate problem of course is this block will never execute if
        pending_delete has no value:
        
          /v1/prefixes?pending_delete    <-- doesn't execute block
        
        This is because Go defaults query params to empty strings and the if
        statement skips this case. Which makes you wonder, what is the value
        supposed to be? This is not explained. If it's supposed to be:
        
          /v1/prefixes?pending_delete=true   <--- executes block
        
        Then this would work, but the implementation fails to validate this
        value. From this you can infer that no unit test was written to
        exercise the value:
        
          /v1/prefixes?pending_delete=false   <-- wrongly executes block
        
        The post explains "initial testing and code review focused on the BYOIP
        self-service API journey." We can reasonably guess their tests were
        passing some kind of "true" value for the param, either explicitly or
        using a client that defaulted param values. What they didn't test was
        how their new service actually called it.
        
        So, while there's plenty to criticize on the testing front, that's
        first and foremost a basic failure to clearly define an API contract
        and implement unit tests for it.
        
        But there's a third problem, in my view the biggest one, at the design
        level. For a critical delete path they chose to overload an existing
        endpoint that defaults to returning everything. This was a dangerous
        move. When high stakes data loss bugs are a potential outcome, it's
        worth considering more restrictive API that is harder to use
        incorrectly. If they had implemented a dedicated endpoint for pending
        deletes they would have likely omitted this default behavior meant for
        non-destructive read paths.
        
        In my experience, these sorts of decisions can stem from team ownership
        differences. If you owned the prefixes service and were writing an
        automated agent that could blow away everything, you might write a
        dedicated endpoint for it. But if you submitted a request to a separate
        team to enhance their service to returns a subset of X, without
        explaining the context or use case very much, they may be more inclined
        to modify the existing endpoint for getting X. The lack of context and
        communication can end up missing the risks involved.
        
        Final note: It's a little odd that the implementation uses Go's "if
        with short statement" syntax when v is only ever used once. This isn't
        wrong per se but it's strange and makes me wonder to what extent an LLM
        was involved.
       
          PunchyHamster wrote 9 hours 3 min ago:
          > But there's a third problem, in my view the biggest one, at the
          design level. For a critical delete path they chose to overload an
          existing endpoint that defaults to returning everything. This was a
          dangerous move. When high stakes data loss bugs are a potential
          outcome, it's worth considering more restrictive API that is harder
          to use incorrectly. If they had implemented a dedicated endpoint for
          pending deletes they would have likely omitted this default behavior
          meant for non-destructive read paths.
          
          Or POST endpoint, with client side just sending serialized object as
          query rather than relying that the developer remembers the magical
          query string.
       
        kgeist wrote 12 hours 32 min ago:
        It's something we debated in our team: if there's an API that returns
        data based on filters, what's the better behavior if no filters are
        provided -  return everything or return nothing?
        
        The consensus was that returning everything is rarely what's desired,
        for two reasons: first, if the system grows, allowing API users to
        return everything at once can be a problem both for our server (lots of
        data in RAM when fetching from the DB => OOM, and additional stress on
        the DB) and for the user (the same problem on their side). Second, it's
        easy to forget to specify filters,  especially in cases like "let's
        delete something based on some filters."
        
        So the standard practice now is to return nothing if no filters are
        provided, and we pay attention to it during code reviews. If the user
        does really want all the data, you can add pagination to your API. With
        pagination, it's very unlikely for the user to accidentally fetch
        everything because they must explicitly work with pagination tokens,
        etc.
        
        Another option, if you don't want pagination, is to have a separate
        method named accordingly, like ListAllObjects, without any filters.
       
          qwertyuiop_ wrote 29 min ago:
          how about returning an error ? Itâs the generic âclient sent
          something wrongâ bucket. Missing a required filter param is
          unambiguously a client mistake according to your own docs/contract
          â client error â 4xx family â 400 is the safest/default member
          of that family.
       
          est wrote 9 hours 2 min ago:
          > to have a separate method named accordingly, like ListAllObjects,
          without any filters
          
          For me it's like `filter1=*`
       
          PunchyHamster wrote 9 hours 8 min ago:
          But that query had parameter. They just fucked up parsing it
       
          Philip-J-Fry wrote 10 hours 56 min ago:
          >allowing API users to return everything at once can be a problem
          both for our server (lots of data in RAM when fetching from the DB =>
          OOM, and additional stress on the DB)
          
          You can limit stress on RAM by streaming the data. You should ideally
          stream rows for any large dataset. Otherwise, like you say you are
          loading the entire thing into RAM.
       
          alemanek wrote 11 hours 49 min ago:
          Returning an empty result in that case may cause a more subtle
          failure.  I would think returning an error would be a bit better as
          it would clearly communicate that the caller called the API endpoint
          incorrectly.  If itâs HTTP a 400 Bad Request status code would seem
          appropriate.
       
          MobileVet wrote 12 hours 23 min ago:
          I like your thought process around the âemptyâ case.  While the
          opposite of a filter is no filter, to your point, that is probably
          not really the desire when it comes to data retrieval.    We might have
          to revisit that ourselves.
       
        djfobbz wrote 12 hours 50 min ago:
        I'm honestly amazed that a company CF's size doesn't have a neat little
        cluster of Mac Minis running OpenClaw and quietly taking care of this
        for them.
       
        user205738 wrote 12 hours 50 min ago:
        They should have rewritten this code in Rust using these brilliant
        language models. /jk
       
        vimda wrote 12 hours 51 min ago:
        One has to wonder when the board realises Dane was a bad replacement
        for JGC. These outages are getting ridiculous
       
        wa008 wrote 13 hours 17 min ago:
        This transparent report can earn my trust
       
        dilyevsky wrote 13 hours 22 min ago:
        > Because the client is passing pending_delete with no value, the
        result of Query().Get(âpending_deleteâ) here will be an empty
        string (ââ), so the API server interprets this as a request for all
        BYOIP prefixes instead of just those prefixes that were supposed to be
        removed.
        
        Lmao, iirc long time ago Google's internal system had the same exact
        bug (treating empty as "all" in the delete call) that took down all
        their edges. Surprisingly there was little impact as traffic just
        routed through the next set of proxies.
       
        alansaber wrote 13 hours 30 min ago:
        Not sure why everyone is complaining, new MCP features are more
        important than uptime
       
        otar wrote 13 hours 33 min ago:
        Reliability was/is CF's label.
        
        It's alarming already. Too many outages in the past months. CF should
        fix it, or it becomes unacceptable and people will leave the platform.
        
        I really hope they will figure things out.
       
          tallytarik wrote 11 hours 0 min ago:
          Weâre still waiting on a solution for [1] (which actually started a
          month earlier than the incident reports)
          
          In the meantime, as you say, weâre now going through and evaluating
          other vendors for each component that CF provides - which is both
          unfortunate, and a frustrating use of time, as CFâs services
          âjust workedâ very well for a very long time.
          
  HTML    [1]: https://www.cloudflarestatus.com/incidents/391rky29892m
       
          argestes wrote 13 hours 28 min ago:
          I have many things dependent on Cloudflare. That makes me root for
          Cloudflare and I think I'm not the only one. Instead of finding
          better options we're getting stuck on an already failing HA solution.
          I wonder what caused this.
       
            slothsarecool wrote 12 hours 48 min ago:
            There are no alternatives, and those alternatives that did exist
            back in the day, had to shut down due to either going out of
            business or not being able to keep a paygo model.
            
            Not everybody needs cloudflare, but those that need it and aren't
            major enterprises, have no other option.
       
              Sanzig wrote 12 hours 40 min ago:
              Bunny.net? Doesn't have near the same feature set as Cloudflare,
              but the essentials are there and you can easily pay as you go
              with a credit card.
       
                slothsarecool wrote 12 hours 21 min ago:
                Their WAF isn't there yet, the moment it can build the
                expressions you can build with CF (and allows you to have as
                much visibility into the traffic as CF does), then it might be
                a solid option, assuming they have the compute/network
                capacity.
       
              pocksuppet wrote 12 hours 41 min ago:
              Lots of people who think they need Cloudflare don't. What are you
              using it for?
       
                slothsarecool wrote 12 hours 23 min ago:
                L7 DDoS protection and global routing + CDN, there is not a
                single paygo provider that can handle the capacity CF can,
                especially not at this price range (mitigated attacks
                distributed from approximately 50-90k ips, adding up to about
                300-700k rps).
                
                We tried Stackpath, Imperva (Incapsula back in the day), etc
                but they were either too expensive or went out of business.
       
                  blibble wrote 11 hours 7 min ago:
                  > especially not at this price range
                  
                  pay peanuts, get monkeys
       
            arcatech wrote 13 hours 1 min ago:
            Do you not feel concern about you and everybody else deciding to
            put ALL of their eggs into one basket like this?
       
              esseph wrote 7 hours 31 min ago:
              Like AWS/GCP/Azure?
       
              ranger_danger wrote 11 hours 22 min ago:
              I would bet money that most people who use CF now are already
              hosting their endpoints at a single provider. I don't think most
              people care until it actually becomes enough of a problem.
       
        NooneAtAll3 wrote 13 hours 34 min ago:
        again?
       
        tokyobreakfast wrote 13 hours 44 min ago:
        Is this trend of oversharing code snippets and TMI postmortems done
        purposely to distract their customers from raging over the outage and
        the next impending fuckup?
       
          samrus wrote 13 hours 6 min ago:
          Just seems like transparency. I agree that we should also judge them
          based on the frequency of these incidents and amwhether they provide
          a path to non-repeatability, but i wouldnt criticize them for the
          transparency per se
       
          bdangubic wrote 13 hours 38 min ago:
          and if they didnât weâd posting about lack of transparency.
          damned if you do, damned if you donât
       
          alansaber wrote 13 hours 40 min ago:
          Well I still appreciate a good postmortem even if I have no doubt
          it'll happen again imminently
       
        anurag wrote 13 hours 47 min ago:
        The one redeeming feature of this failure is staged rollouts. As
        someone advertising routes through CF, we were quite happy to be spared
        from the initial 25%.
       
        jaboostin wrote 13 hours 47 min ago:
        Hindsight is 20/20 but why not dry run this change in production and
        monitor the logs/metrics before enabling it? Seems prudent for any new
        âdelete something in prodâ change.
       
        VirusNewbie wrote 13 hours 54 min ago:
        If you track large SaaS and Cloud uptime, it seem to correlate pretty
        highly with compensation for big companies.  Is cloudflare getting top
        talent?
       
          bombcar wrote 13 hours 49 min ago:
          Based on IPO date and lockups, I suspect top talent is moving on.
       
        henning wrote 13 hours 55 min ago:
        Sure vibe-coded slop that has not been properly peer reviewed or tested
        prior to deployment is leading to major outages, but the point is they
        are producing lots of code. More code is good, that means you are a
        good programmer. Reading code would just slow things down.
       
          sp00chy wrote 13 hours 11 min ago:
          thatâs my feeling also. We will get this more and more in future.
       
        himata4113 wrote 14 hours 5 min ago:
        This blog post is inaccurate, the prefixes were being revoked over and
        over - to keep your prefixes advertised you had to have a script that
        would readd them or else it would be withdrawn again. The way they
        seemed to word it is really dishonest.
       
        ssiddharth wrote 14 hours 10 min ago:
        The eternal tech outage aphorism: It's always DNS, except for when it's
        BGP.
       
          subscribed wrote 11 hours 59 min ago:
          You could argue BGP is like DNS for IPs :)
       
        NinjaTrance wrote 14 hours 14 min ago:
        The irony is that the outage was caused by a change from the "Code
        Orange: Fail Small initiative".
        
        They definitely failed big this time.
       
        blibble wrote 14 hours 14 min ago:
        is this blog post LLM generated?
        
        the explanation makes no sense:
        
        > Because the client is passing pending_delete with no value, the
        result of Query().Get(âpending_deleteâ) here will be an empty
        string (ââ), so the API server interprets this as a request for all
        BYOIP prefixes instead of just those prefixes that were supposed to be
        removed. The system interpreted this as all returned prefixes being
        queued for deletion.
        
        client:
        
             resp, err := d.doRequest(ctx, http.MethodGet,
        `/v1/prefixes?pending_delete`, nil)
        
        server:
        
            if v := req.URL.Query().Get("pending_delete"); v != "" {
            // ignore other behavior and fetch pending objects from the
        ip_prefixes_deleted table
            prefixes, err :=
        c.RO().IPPrefixes().FetchPrefixesPendingDeletion(ctx)
            if err != nil {
                api.RenderError(ctx, w, ErrInternalError)
                return
            }
        
            api.Render(ctx, w, http.StatusOK,
        renderIPPrefixAPIResponse(prefixes, nil))
            return
            }
        
        even if the client had passed a value it would have still done exactly
        the same thing, as the value of "v" (or anything from the request) is
        not used in that block
       
          PunchyHamster wrote 9 hours 1 min ago:
          better explanation here [1] but in short they are changing whether
          string is empty, and query string  "pending_delete" is same as
          "pending_delete=" and will return empty
          
          Or, if they specified `/v1/prefixes?pending_delete=potato` it would
          return "correct" list of objects to delete
          
          Or in other words "Go have types safety, fuck it, let's use strings
          like in '90s PHP apps instead"
          
  HTML    [1]: https://news.ycombinator.com/item?id=47106852
       
          subscribed wrote 12 hours 2 min ago:
          That's weird. They only removed some 6 of our prefixes out of perhaps
          40 we have with them, so something seems off in this explanation.
       
          himata4113 wrote 14 hours 4 min ago:
          yep, no mention that re-advertised prefixes would be withdrawn again
          as well during the entire impact even after they shut it down.
       
          bretthoerner wrote 14 hours 8 min ago:
          > even if the client had passed a value it would have still done
          exactly the same thing, as the value of "v" (or anything from the
          request) is not used in that block
          
          If they passed in any value, they would have entered the block and
          returned early with the results of FetchPrefixesPendingDeletion.
          
          From the post:
          
          > this was implemented as part of a regularly running sub-task that
          checks for BYOIP prefixes that should be removed, and then removes
          them.
          
          They expected to drop into the block of code above, but since they
          didn't, they returned all routes.
       
            blibble wrote 13 hours 54 min ago:
            okay so the code which returned everything isn't there
            
            actual explanation: the API server by default returns everything.
            the client attempted to make a request to return "pending_deletes",
            but as the request was malformed, the API instead went down the
            default path, which returned everything. then the client deleted
            everything.
            
            makes sense now
            
            but is that explanation is even worse
            
            because that means the code path was never tested?
       
              jbxntuehineoh wrote 13 hours 23 min ago:
              or they tested it, but not with a dataset that contained prefixes
              not pending deletion
       
          bstsb wrote 14 hours 10 min ago:
          doesn't look AI-generated. even if they have made a mistake, it's
          probably just from the rush of getting a postmortem out prior to root
          cause analysis
       
        dryarzeg wrote 14 hours 24 min ago:
        DaaS - Downtime as a ServiceÂ©
        
        Just joking, no offence :)
       
          logicchains wrote 12 hours 56 min ago:
          DaaS is good ja
       
        atty wrote 14 hours 30 min ago:
        I do not work in the space at all, but it seems like Cloudflare has
        been having more network disruptions lately than they used to. To
        anyone who deals with this sort of thing, is that just recency bias?
       
          slophater wrote 13 hours 39 min ago:
          been at cf for 7 yrs but thinking of gtfo soon. the ceo is a
          manchild, new cto is an idiot, rest of leadership was replaced by
          yes-men, and the push for AI-first is being a disaster. c levels
          pretend they care about reliability but pressure teams to constantly
          ship, cto vibe codes terraform changes without warning anyone, and
          it's overall a bigger and bigger mess
          
          even the blog, that used to be a respected source of technical
          content, has morphed into a garbage fire of slop and vaporware
          announcements since jgc left.
       
            nanankcornering wrote 4 hours 42 min ago:
            do you care that much about leadership when using a service? even I
            dont know gcp's c-level, aws's c-level, even vercel's c-level. only
            know rauchg.
            
            i think i care much more about our SLAs (if any)
       
            sebmellen wrote 11 hours 16 min ago:
            Do you feel that Matthew Prince is still technically
            active/informed? I've interacted with him in the past and he seemed
            relatively technically grounded, but that doesn't seem as true
            these days.
       
              3rodents wrote 9 hours 53 min ago:
               [1] [2] Rather than be driven by something rational like
              building a great product or making lots of money he is apparently
              driven by a desperate fear of being a dinosaur.
              
              Regardless of how competent he is or isnât as a technologist, a
              leader leading with fear is a recipe for disaster.
              
  HTML        [1]: https://xcancel.com/eastdakota/status/202521549514256417...
  HTML        [2]: https://xcancel.com/eastdakota/status/202522127006158045...
       
                NicoJuicy wrote 1 hour 15 min ago:
                Development has changed, I don't see much else he mentions.
                
                As a dev, you'll need to have those stuff in your knowledge
                toolbox
       
            __turbobrew__ wrote 12 hours 59 min ago:
            You know what they say, shit rolls downhill. I don't personally
            know the CEO, but the feeling I have got from their public fits on
            social media doesn't instill confidence.
            
            If I was a CF customer I would be migrating off now.
       
            goalieca wrote 13 hours 5 min ago:
            Iâve had a lot of problems lately. Basic things are failing and
            itâs like product isnât involved at all in the dash. Whatâs
            worse? The support.. the chat is the buggiest thing Iâve ever
            seen.
       
              slophater wrote 11 hours 35 min ago:
              don't worry, if it gets much worse the ceo will just throw all of
              support under the bus again. it will surely get better.
       
                goalieca wrote 9 hours 38 min ago:
                How about accurate billing info. The ux canât even figure out
                weâre annually not monthly. Maybe the AI slop will continue
                to miscount resources and cost you revenue or piss off a
                customer when the dashboards they been using donât match the
                invoice
       
            a24446ff87 wrote 13 hours 12 min ago:
            GSD! GSD!! ship! ship! ship!
            
            **everything breaks**
            
            ...
            
            **everything breaks again**
            
            oh fuck! Code Orange! I repeat, Code Orange! we need to rebuild
            trust(R)(TM)! we've let our customers down!
            
            ...
            
            **everything breaks again**
            
            Code Orangier! I repeat, Code Orangier!
       
              slophater wrote 11 hours 39 min ago:
              exactly. recently "if the cto is shipping more than you, you're
              doing something wrong"
              
              cto can't even articulate a sentence without passing it through
              an LLM, and instead of doing his job he's posting the stupidest
              shit to his personal bootlicking chat channel. I cringe every
              time at the brown-nosers that inhabit that hovel.
              
              no words for what the product org is becoming too. they should
              take their own advice a bit further and just replace all the
              leadership with an LLM, it would be cheaper and it's the same
              shit in practice
       
                ifwinterco wrote 1 hour 33 min ago:
                I have worked in some dysfunctional places but nothing like
                that, does sound bad.
                
                Just got to keep your head, remember itâs just a job and you
                get paid regardless. Clock in, clock out, do the work assigned
                to you but mentally just check out while you look for a new
                role
       
            slophater wrote 13 hours 31 min ago:
            amazing how my comment was flagged in 30 seconds... keep
            bootlicking
       
          candiddevmike wrote 13 hours 59 min ago:
          Wait till you see the drama around their horrible terraform provider
          update/rewrite:
          
  HTML    [1]: https://github.com/cloudflare/terraform-provider-cloudflare/...
       
          dazc wrote 14 hours 4 min ago:
          Launching a new service every 5 minutes is obviously stretching their
          resources.
       
          lysace wrote 14 hours 8 min ago:
          It has been roughly speaking five and a half years since the IPO. The
          original CTO (John Graham-Cumming) left about a year ago.
       
            Cipater wrote 2 hours 47 min ago:
            He's still a member of the board though.
       
            dazc wrote 14 hours 2 min ago:
            I wondered what happened to him?
       
              jgrahamc wrote 12 hours 50 min ago:
              I am reading HN.
       
                SoKamil wrote 11 hours 58 min ago:
                What is your opinion on the recent Cloudflare outages?
       
              brcmthrowaway wrote 13 hours 56 min ago:
              He's on a yacht somewhere
       
                tedd4u wrote 13 hours 48 min ago:
                For real
       
            jacquesm wrote 14 hours 4 min ago:
            They coasted on momentum for half a year. I don't even think it
            says anything negative about the current CTO, but more of what an
            exception JGC is relative to what is normal. A CTO leaving would
            never show up the next day in the stats, the position is strategic
            after all. But you'd expect to see the effect after a while, 6
            months is longer than I would have expected, but short enough that
            cause and effect are undeniable.
            
            Even so, it is a strong reminder not to rely on any one vendor for
            critical stuff, in case that wasn't clear enough yet.
       
              lysace wrote 10 hours 19 min ago:
              You can coast for quite some time (5-10 years?) if you really
              lean into it (95% of the knowledge of maintaining and scaling the
              stack is there in the minds of hundreds of developers).
              
              Seems like Matthew Prince didn't choose that route.
       
                jacquesm wrote 8 hours 8 min ago:
                The problem is that CF operates in a highly dynamic environment
                and you can't really do that if the minds of those hundreds of
                developers relied for the major decision making on a key
                individual.
                
                This is the key individual paradox: they can be a massive asset
                and make the impossible happen but if and when they leave
                you've got a real problem unless you can find another
                individual that is just as capable. Now, I do trust JGC to have
                created an organization that is as mature as it could be, but
                at the same time it is next to impossible to quantify your own
                effect on the whole because you lack objectivity and your
                underlings may not always tell you the cold hard truth for
                reasons all their own.
                
                And in this case the problem is even larger: the experience
                collected by the previous guru does not transfer cleanly to the
                new one, simply because the new one lacks the experience of
                seeing the company go from being a tiny player to being a
                behemoth, and that's something you can do only once.
                
                I've always been of the opinion that without JGC Cloudflare did
                not stand a chance, irrespective of those hundreds of
                developers. And that's before we get into things like goodwill.
                
                And of those hundreds of developers you have to wonder how many
                see the writing on the wall and are thinking of jumping ship.
                The best ones always leave first.
                
                I would not be surprised at all if this whole saga ends with
                Google, Microsoft or Amazon absorbing CF at a fraction of its
                current value.
       
          Betelbuddy wrote 14 hours 14 min ago:
          Cloudflare Outages are as predictable, as the Sun coming up tomorrow.
          Its their engineering culture.
          
  HTML    [1]: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
       
          Icathian wrote 14 hours 26 min ago:
          It is not. They went about 5 years without one of these, and had a
          handful over the last 6 months. They're really going to need to
          figure out what's going wrong and clean up shop.
       
            Ylpertnodi wrote 13 hours 40 min ago:
            Typo: "shop", should have been with an 'el'.
            
            (: phonetically, because 'l's are hard to read.
       
            NinjaTrance wrote 14 hours 18 min ago:
            Engineers have been vibe coding a lot recently...
       
              dakiol wrote 13 hours 53 min ago:
              No joke. In my company we "sabotaged" the AI initiative led by
              the CTO. We used LLMs to deliver features as requested by the
              CTO, but we introduced a couple of bugs here and there
              (intentionally). As a result, the quarter ended up with more time
              allocated to fix bugs and tons of customer claims. The CTO is now
              undoing his initiative. We all have now some time more to keep
              our jobs.
       
                rixed wrote 2 hours 41 min ago:
                Sounds like what an LLM would post if it were tasked to
                advertise LLM coding abilities. Nice manipulation of human
                emotions, well played.
       
                renegade-otter wrote 13 hours 2 min ago:
                I see someone is not familiar with the joys of the current job
                market.
       
                hypeatei wrote 13 hours 11 min ago:
                That's extremely unethical. You're being paid to do something
                and you deliberately broke it which not only cost your employer
                additional time and money, but it also cost your customers time
                and money. If I were you, I'd probably just quit and find
                another profession.
       
                samrus wrote 13 hours 17 min ago:
                Thats actively malicious. I understand not going out of your
                way to catch the LLMs' bugs so as to show the folly of the
                initiative, but actively sabotaging it is legitimately
                dangerous behavior. Its acting in bad faith. And i say this as
                someone who would mostly oppose such an initiative myself
                
                I would go so far as to say that you shouldnt be employed in
                the industry. Malicious actors like you will contribute to an
                erosion of trust thatll make everything worse
       
                  tovej wrote 11 hours 45 min ago:
                  Forcing developers to use unsafe LLM tools is also malicious.
                  This is completely ethical to me. Not commenting on legality.
       
                    samrus wrote 9 hours 5 min ago:
                    I dont like it either but its not malicious. The LLM isnt
                    accessing your homeserver, its accessing corporate
                    information. Your employer can order you to be reckless
                    with their information, thats not malicious, its not your
                    information. You should CYA and not do anything illegal
                    even if your asked. But using LLMs isnt illegal. This is
                    bad faith argument
       
                  sp00chy wrote 13 hours 6 min ago:
                  Might be but sometimes you donât have another choice when
                  employers are enforcing AIs which have no âfeelingâ for
                  context of all business processes involved created by human
                  workers in the years before. Those who spent a lot of love
                  and energy for them mostly. And who are now forced to work
                  against an inferior but overpowered workforce.
                  
                  Donât stop sabotaging AI efforts.
       
                    samrus wrote 9 hours 3 min ago:
                    Honestly i kinda like the aesthetic of cyberanarchism, but
                    its not for me. It erodes trust
       
                logicchains wrote 13 hours 37 min ago:
                That's not "sabotaged", that's sabotaged, if you intentionally
                introduced the bugs. Be very careful admitting something like
                that publicly unless you're absolutely completely sure nobody
                could map your HN username to your real identity.
       
              jsheard wrote 14 hours 10 min ago:
              The featured blog post where one of their senior engineering PMs
              presented an allegedly "production grade" Matrix implementation,
              in which authentication was stubbed out as a TODO, says it all
              really. I'm glad a quarter of the internet is in such responsible
              hands.
       
                ranger_danger wrote 11 hours 20 min ago:
                Matrix doesn't actually define how one should do authentication
                though... every homeserver software is free to implement it
                however they want.
       
                  Arathorn wrote 9 hours 50 min ago:
                  the main bit of auth which was left unimplemented on
                  matrix-workers was the critical logic which authorizes
                  traffic over federation: [1] Auth for clients is also
                  specified in the spec - there is some scope for homeservers
                  to freestyle, but nowadays they have to implement OIDC:
                  
  HTML            [1]: https://spec.matrix.org/latest/server-server-api/#au...
  HTML            [2]: https://spec.matrix.org/latest/client-server-api/#cl...
       
                gtowey wrote 13 hours 4 min ago:
                It's spreading and only going to get worse.
                
                Management thinks AI tools should make everyone 10x as
                productive, so they're all trying to run lean teams and load up
                the remaining engineers with all the work. This will end about
                as well as the great offshoring of the early 2000s.
       
                blibble wrote 13 hours 41 min ago:
                there was also a post here where an engineer was parading
                around a vibe-coded oauth library he'd made as a demonstration
                of how great LLMs were
                
                at which point the CVEs started to fly in
       
                dana321 wrote 14 hours 6 min ago:
                Thats a classic claude move, even the new sonnet 4.6 still does
                this.
       
                  allovertheworld wrote 3 hours 36 min ago:
                  Wait till you get AI to write unit tests and tell it the test
                  must pass. After a few rounds it will make the test
                  âassert(true)â when the code cant get the test to pass
       
                  bonesss wrote 13 hours 58 min ago:
                  Itâs almost as classic as just short circuiting tests in
                  lightly obfuscated ways.
                  
                  I could be quite the kernel developer if making the test 
                  green was the only criteria.
       
        CommonGuy wrote 14 hours 31 min ago:
        Insufficient mock data in the staging environment? Like no BYOIP
        prefixes at all? Since even one prefix should have shown that it would
        be deleted by that subtask...
        
        From all the recent outages, it sounds like Cloudflare is barely tested
        at all. Maybe they have lots of unit tests etc, but they do not seem to
        test their whole system... I get that their whole setup is vast, but
        even testing that subtask manually would have surfaced the bug
       
          zmj wrote 11 hours 36 min ago:
          Testing the "whole system" for a mature enterprise product is quite
          difficult. The combinatorial explosion of account configurations and
          feature usage becomes intractable on two levels: engineers can't
          anticipate every scenario they need their tests to cover (because the
          product is too big understand the whole of), and even if
          comprehensive testing was possible - it would be impractical on some
          combination of time, flakiness, and cost.
       
          suhputt wrote 11 hours 44 min ago:
          my guess is the company is rotting from the inside and drowning in
          tech debt
       
          martinald wrote 13 hours 23 min ago:
          Just crazy. Why does a staging environment matter? They should be
          running some integration tests against eg an in memory database for
          these kinds of tasks surely?
       
          dabinat wrote 14 hours 16 min ago:
          I think Cloudflare does not sufficiently test lesser-used options. I
          lurk in the R2 Discord and a lot of users seem to have problems with
          custom domains.
       
          asciii wrote 14 hours 22 min ago:
          It was also merged 15 days prior to production release...however,
          you're spot on with the empty test. That's a basic scenario that if
          it returned all...is like oh no.
       
        boarush wrote 14 hours 31 min ago:
        While neither am I nor the company I work for directly impacted by this
        outage, I wonder how long can Cloudflare take these hits and keep
        apologizing for it. Truly appreciate them being transparent about it,
        but businesses care more about SLAs and uptime than the incident
        report.
       
          jacquesm wrote 14 hours 1 min ago:
          Bluntly: they expended that credit a while ago. Those that can will
          move on. Those that can't have a real problem.
          
          As for your last sentence:
          
          Businesses really do care about the incident reports because they
          give good insight into whether they can trust the company going
          forward. Full transparency and a clear path to non-repetition due to
          process or software changes are called for. You be the judge of
          whether or not you think that standard has been met.
       
            boarush wrote 13 hours 55 min ago:
            I might be looking at it differently, but aren't decisions over a
            certain provider of service being made by the management. Incident
            reports don't ever reach there in my experience.
       
              jacquesm wrote 10 hours 28 min ago:
              Every company that relies on their suppliers and that has mature
              management maintains internal supplier score cards as part of
              their risk assessment, more so for suppliers that are hard to
              find replacements for. They will of course all have their of
              thresholds for action but what has happened in the last period
              with CF exceeds most of the thresholds for management comfort
              that I'm aware of.
              
              Incident reports themselves are highly technical, so will not
              reach management because they are most likely simply not equipped
              to deal with them. But the CTOs of the companies will take
              notice, especially when their own committed SLAs are endangered
              and their own management asks them for an explanation. CF makes
              them all look bad right now.
       
              samrus wrote 13 hours 9 min ago:
              In my experience, the gist of it does reach management when its
              an existing vendor. Especially if management is tech literate
              
              Becuase management wants to know why the graphs all went to zero,
              and the engineers have nothing else to do but relay the incident
              report.
              
              This builds a perception for management of the vendor, and if the
              perception is that the vendor doesnt tell them shit or doesnt
              even seem to know theres an outage, then management can decide to
              shift vendors
       
          llama052 wrote 14 hours 17 min ago:
          Iâll take clarity and actual RCAs than Microsoftâs approach of
          not notifying customers and keeping their status page green until
          enough people notice.
          
          One thing I do appreciate about cloudflare is their actual use of
          their status page. Thatâs not to say these outages are okay. They
          arenât. However Iâm pretty confident in saying that a lot of
          providers would have a big paper trail of outages if they were more
          honest to the same degree or more so than cloudflare. At least from
          what Iâve noticed, especially this year.
       
            boarush wrote 14 hours 12 min ago:
            Azure straight up refuses to show me if there's even an incident
            even if I can literally not access shit.
            
            But last few months has been quite rough for Cloudflare, and a few
            outages on their Workers platform that didn't quite make the
            headlines too. Can't wait for Code Orange to get to production.
       
       
   DIR <- back to front page