Blogs

Should you let social media platforms use your data to train generative AI?

By Mary Warner posted 20 days ago

  

After LinkedIn added a privacy setting that opted everyone in to allowing their data to be used to train LinkedIn’s generative AI (GenAI), MSBA’s members had some questions. They wanted to know how LinkedIn might use that data, both in sharing it with other users and in altering the content of their feeds. They also wanted to know what to consider in deciding whether to opt in or out of this GenAI training.

There’s a lot to unpack in answering these questions, so let’s start with some background information.

LinkedIn is owned by Microsoft, which has invested billions of dollars of cloud computing resources into developing OpenAI’s ChatGPT, which sparked the start of generative AI for widespread use in November 2022.

There are many GenAI platforms now on the market and they all have an insatiable need for fresh data. This data, which can include anything easily accessible online, is used to train the foundation models and large language models (LLMs) behind GenAI platforms. Because GenAI so quickly produces new content from the data it has trained on, it seems as though these platforms could simply use GenAI outputs to further train their models, but that doesn’t work. After a few iterations of feeding GenAI content into the models, the results are pretty much garbage.

So, fresh data is absolutely necessary to increase accuracy and grow GenAI platforms.

For as vast as online content seems, it’s not enough for GenAI. According to a quick Google search, LLMs may run out of data to train on between 2026 and 2032. At the earliest, that’s less than 2 years away, so when you’ve got a data-rich environment liked LinkedIn or any other social media platform, GenAI companies want to tap into it.

Because U.S. data privacy laws tend to favor social media corporations over the privacy rights of individuals, we end up with platforms like LinkedIn automatically opting users in on training their GenAI, with no say in the matter ahead of time. (Meta/Facebook is doing the same, as is X/Twitter.) GenAI companies used this same technique when they originally vacuumed up all the content they could get their hands on from the web, with no consideration to possible copyright claims. This is why we’re seeing so many lawsuits from content creators against GenAI companies.

While GenAI companies are eager to use our private data, they are secretive about the specific content they are using, and how they are using it to train their models and generate outputs. 

When I first started experimenting with OpenAI’s ChatGPT and Microsoft Copilot, I tried to determine if they had used my personal blog for training by using some creative prompting. I asked them to respond to questions for which the answers were only on my blog, which is fairly quirky, so there's stuff there that would be difficult or impossible to find elsewhere on the web. Both replied with answers that indicated they had retrieved the info from my blog. When I asked ChatGPT for the source, it would not provide a link to my blog, but Microsoft Copilot did.

This is a fiddly way to figure out how a GenAI platform is using your data, and it shouldn’t be this hard.

When it comes to a platform like LinkedIn, where we share personal information, we ought to know exactly what information GenAI companies are using and how it might show up in responses. Is the data anonymized? What happens if that personal data escapes the LinkedIn ecosystem? Will Microsoft use it for more than training LinkedIn’s AI?

A member asked another good question: Will LinkedIn use our data to create even narrower silos of what we are presented in our timelines? I’d say that’s possible, but we can’t be sure.

Given these uncertainties, here are some things to consider when deciding whether to allow social media platforms to use your data to train GenAI.

Data privacy – Do you have any personal information on a social media platform that you don’t want to appear elsewhere? How much of the data you share on social media do you want to see used in ways you have no say in? Or are you an open book online and are fine with your info and content being used to train GenAI?

Intellectual property – Do you want to protect the copyright on any content you share, to ensure you get full credit or compensation for the work?

Altruism – Do you want to help train GenAI so that it becomes more accurate? Do you have special knowledge to share or a particular perspective that might not otherwise be represented in training AI?

Each person needs to make these determinations based on their comfort levels, but, if you’re not sure, the safest course of action is to turn off GenAI training until you’ve had more time to consider the ramifications.

***

0 comments
3 views

Permalink