There's now an official definition of "open source AI", which companies like Facebook have been using (prior to this definition) to effectively attempt to convince us (Devs) that the models are in keeping with the values of Open Source.

When the reality for many of these companies and their models is that the content driving the models were stolen (taken without permission, regardless of whether that content is publicly available - like training on photos of artwork in a museum, or music on the radio or videos on TV - all "publicly available").

The definition outlines that a published "open source" AI model should include:

Sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system.

...and the gut punch to companies co-opting "open source":

this must include: (1) the complete description of all data used for training, including (if used) of unshareable data, disclosing the provenance of the data, its scope and characteristics, how the data was obtained and selected

So it would be more fitting if companies like Facebook (with Llama etc) used proprietary or probably trained on copyright content...

Source: opensource.org