LLMs - I won't be fooled again | Nortrup in Development

LLMs - I won't be fooled again

30 Apr 2023

I’ve stayed on the sidelines of much of the LLM debate, but I’m willing to stake out my skepticism on this “world changing” technology. As I think back over the arch of my online life, I see these foundational models as the latest in a series of events where amazing technology was going to change the world. But each time individual value is diluted for corporate money grabs.

Going way back in time, my first real formative experience on the internet was Geocities. When a friend showed me that I could write an HTML file and make it show up on the internet I was hooked. It was magic to me that I could make something the whole world could see. I quickly fell down the rabbit hole of how to make these sites.

My site was entirely unimpressive and probably had terrible design. Sadly, or maybe for the better, it doesn’t exist anymore. I made my first money on the internet with that site. I hosted ads for a career site and manged to make something like $87. The “neighborhoods” that GeoCities fostered around topics were real, I joined mine and became a “Community Leader” and learned some more about building websites. Eventually I taught some folks how to write HTML. It was great.

Then Yahoo decided to buy Geocities. With the purchase came terms of service that appeared to give Yahoo rights over user content. They later revised their terms of service, but the damage was done. People fled the site, the neighborhood structure was dissolved and communities disbanded. Users were prohibited from hosting their own ads.

Rather famously the acquisition didn’t work out and GeoCities was quietly closed. This was my first brush with my content and presence online being a chip in corporate money making plans. Then there was the dot com crash in the early 2000s. GeoCities probably was born and was an early casualty of that first tech exuberance.

Next came social media. First was MySpace (which I didn’t really use), but more famously came Facebook. Landing at RIT during my freshman year it seemed like a great way to get to know folks and build communities. It was contained, and you could only use it to see other people at your school, you could find people in your classes and dorms and clubs.

Over time, you got to keep your account as you left school, then you could talk to people at other schools. It was genuinely useful as I left collage and joined the Army. I could stay in touch with friends from high school and collage while I was a world away in Iraq for 14 long months. Facebook was an invaluable tool to stay connected to home, and friends. To stay sane a world away in a combat zone.

Yes there were some challenges with the site, I probably should have been more concerned about privacy earlier. I certainly knew they were selling me ads, but that seemed like a fair trade for the utility.

Over time, as we all put in more and more data that contract wore thin. The breaking point for me was when they seemed to allow my data to be used to break democracy and quite honestly my country. I know that in retrospect the Cambridge Analytics data probably didn’t have a serious effect on the election, but the contract had changed. It was no longer on even trade with Facebook. My relationships and conversations would be leveraged to sell more ads at higher prices in more markets.

I was no longer a customer, I was the inventory.

Next came crypto. What a freaking mess that is. Literal world burning shitshow. Maybe you can tell this one makes me angry.

Growing out of understandable distrust of the financial system post 2008 financial crisis, crypto bros (yes mostly bros) have promised that they would forge a new social contract where users could put their information beyond the reach of censorship and make money too!

Unfortunately it turned out that a huge amount of crypto is outright scams and the part that weren’t scams were used to facilitate terrorism, cybercrime, extortion, murder for hire, and finance North Korea’s nuclear weapons and ICBM program. All while sucking up energy in a way that made it materially harder to prevent the worst effects of climate change.

In the meantime regular people were conned into putting their life-savings into a system that speed ran every financial crime in 100 years in a span of a decade. If you don’t believe me read Web3 is Going Just Great.

And now we’ve arrived at Large Language Models or Foundational Models if we want to also look at the image generating versions.

Once again we are promised that this new technology will bring about a revolution in personal expression and business efficiency. These models will write copy for us, write code for us and give lost kittens home. We are once again promised Care Bears and cupcakes if we just embrace this technology.

But I don’t plan to be fooled again. The premise of these models is that you will need to feed it all of your data to a cooperation that will use your data to build bigger and more elaborate models to produce more and more convincing bullshit. All while continuing to spend extraordinary amounts of compute power.

The thing I find most frustrating is that they are intent to suck up the whole of the internet. Even if you have walked the indiweb path and hosted your own bespoke handcrafted HTML website for years, a foundational model will crawl your data and regurgitate your content. No questions, no compensation, just your content for corporate gain.

It’s even worse if you have been taking part in social communities hosted by companies. Then you will be cashed in twice. Your host will sell your data to a LLM who will then sell their model to a different company as a product.

Reddit has changed their API terms to make it easier to sell your communities content to a LLM.
GitHub repurposed all the open source content they had on their site to train Copilot, which they will sell to companies and individuals to allow them to code faster based on all of your hard work.
Stack Overflow has decided to take all the questions and answers on their site, start training LLMs with them and then sell that to their corporate customers. So decades of community contributions given under a creative commons license will now benefit SO and their corporate customers.
As the Washington Post helped highlight in this post the LLM data sets also include anything they can hoover up off of the internet. So right now the contents of this page could be consumed into these models, helping them write blog posts in the style of Andy Nortrup.

Maybe I’m now old enough to be a curmudgeon about all of these new toys and the way things used to be. But my experience in consumer tech over the last three-ish decades has been that each of these revolutions has been that each one finds ways to privatize the commons to enrich the folks who could grab the data.

Once again, I feel like my data, my writing, my questions and curiosity are being turned into inventory for OpenAI / Microsoft, Google and Facebook to write a bot that sounds like I should trust it.

The outcomes are already obvious and pointed out by folks far smarter than me. Regulators in Europe and in the United States are starting to respond.

How are these models cleaning the data from private data?
Do EU citizens have a right to have their data expunged from the training data as allowed by the GDPR?
Who is at fault when a model slanders or libels someone in the output?
How is this going to supercharge misinformation, spam, and phishing? I can’t wait for a deep faked video call to my grandmother using a script based on all of my writing bilks her out of her life savings to funnel cryptocurrency to some neo-fascist politician.

I’m just tired of being lead to the slaughter once again.