Apache license is irrevocable.

#5
by xzuyn - opened

70a099a35d7122880b024af885de90953b64fce7
irrevocable.png

Model is already released under that license. I don't think you can just change it to a different license without the model being any different, so I think Apache-2.0 (commercial use permitted) still applies to it.

At the very least, users who downloaded it when you listed it with that license are allowed to use it for commercial use.

I'm not a lawyer.

Mosaic ML, Inc. org

I'm not a lawyer either. Consult your attorney and decide accordingly :)

jfrankle changed discussion status to closed
This comment has been hidden
deleted

README has both licenses listed. Might want to fix that up.

License: Apache-2.0 (commercial use permitted)
This model was trained by MosaicML

Mosaic ML, Inc. org

Thank you! It's now fixed everywhere.

Might I suggest making an announcement about this in a more obvious capacity? If folks downloaded it while the license was Apache 2.0 you'll likely want to make it very obvious that the license has changed if they downloaded it beforehand, since this is a very major change that affects how users can make use of the model. Right now, if the commit message is the only indication that the license has changed, it's not obvious at all (I certainly wouldn't have noticed had I not stumbled across this thread!). I'd suggest either sticking an announcement at the very top of the main page that the model's license has been changed, or making an official blog post about it, or some combination thereof so that it's extremely clear.

@xzuyn @Monero The license only can license what MosaicML is allowed to license. They changed the terms to reflect the fact they don't have the right to license the model because it was fine-tuned under Fair Use Assumptions.

As fas as I'm concerned the Apache license was never valid because they don't have those rights, it was void from the start and not revoked.

I'm also not a lawyer, but if you assume you have rights you don't and can't have, you're still liable for that assumption.

@jfrankle I agree a statement would be nice. I'm among the people that raised the issue on Twitter and would like to post a follow-up and acknowledge this; it's the right move and an important one to show respect for creators/authors/writers.

deleted

I'm not sure I understand what legal argument is being forwarded here relative to Fair Use vis-a-vis the model.

Is this in reference to the books3 dataset? That dataset is licensed MIT and could maybe be argued as a Fair Use violation if the authors were so inclined to be upset by it.

As yet there is no case law or precedent I'm aware of to suggest that matrices of numbers generated as a relativistic difference between various word orders appearing in a given text. Maybe this will be a concern when the SD lawsuits shake out, though it's going to be a very uphill argument to make that styles or manners of phrasing English should be copyrightable expressions, considering how aggressively transformative the process of making the models is to begin with. I mean, I'd argue the model is no more an offending technology than a computer hard drive and a printer. If anything, it's a far less effective one. But that would also just be my personal argument and not case law or legal precedent.

As an author and professional writer myself, I'd appreciate if they reverted it to Apache 2.0 and immediately started making money off of it with my personal blessing as a self-appointed representative of the collective will of all authors, living and dead, and with all of the rights and allowances afforded to me by such a proclamation, which I'm sure of multitudinous. It would show the greatest respect for me personally, and thus to all writers living and dead, and I think we can all agree that's the most important thing.

I decline to disclaim my status as a lawyer one way or another for sake of valuing my status as a Man of Mystery, Occasionally International.

@gozfarb You need (i) Copyright licenses or (ii) Copyright exceptions (e.g. Fair Use) for the following —each of which can be assessed separately:

  • download and process the works
  • memorize the content of works
  • produce output similar to works

The larger the models and the larger the datasets, the worse it gets for each of those points. In particular the first two are very problematic with LLMs and web-scale datasets. What used to be borderline acceptable is now in "legally risky" territory to say the least.

I believe the legal status is very clear everywhere except the U.S. (maybe intentionally to benefit those with large pockets) — but since most companies want to operate abroad in countries that also signed the Berne Convention, that's a good baseline. In there, many clauses make it clear what's no longer acceptable as "Fair Use": for instance, harming the economic interests or causing prejudice to the moral rights of authors, writers, creators is simply not allowed and is thus considered infringing.

I agree it needs to be formalized! I'm surprised HuggingFace didn't do anything about this to better understand and handle this, as ultimately they can be liable for facilitating the dissemination of content that infringe on others rights.

Again, I can't speak to the specifics of international law on the matter, but please do provide links for the three standards of Fair Use that you're referencing here because those are absolutely not the standard for Fair Use in the US (I won't rehash the four factors here), so I'd be interested to read up on the relevant EU and UK laws on the matter. I was not aware of firm Fair Use arguments being widely used outside the US in that specific terminology or with such nebulous legal tests as "memorizing" for said Fair Use. Secondarily, I'd disagree that the Berne Convention (more like guidelines, really) is a good baseline. It is woefully out of date (1971 was the last revision) and is essentially unupdateable in practical terms and while these circumstances do need good case law, Berne is probably not the door anyone concerned about the creeping scope of copyright wants to knock on. Likewise, as a deep skeptic of rigid international laws being enforced in individual sovereign nations, I don't like projects like the Berne convention, especially because it precludes things like individual treaties or exceptions assuming anyone wants to honor it meaningfully anyway.

As to Fair Use, it is fairly formalized in America. Moreover there's fairly relevant case law to book processing and digitization in particular (https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,_Inc.) that tracks at least in part onto the works and use in question here. Obviously, everyone wants to avoid the lawsuit in the first place and I stand by my claim that this particular subset of the issue isn't solved, or doesn't seem to be, but it seems that the transformative nature of creating the model essentially divorces the end product from the original work nearly entirely. If the book could be readily retrieved AND that was the primary purpose and goal of the technology, I could see my way to an agreement there. But just like the rulings in US cases relating to home recording of videos and tapes, if a technology has a legitimate, legal purpose, it is not in the interests of the state or the people to restrict use of that technology to please the specific interests of people who might be tangentially aggrieved by some part of it (in this case, the somewhat faulty idea that the specific relative token weights of the words in the books are being used as a means of stealing some protectable asset).

That argument is somewhat tangential, but I'd say that there's not a whole lot of meat on the bone for authors as far as America is concerned. I could be wrong, it's not settled by any stretch and I understand Mosaic wanting to avoid the legal costs of, in my opinion, misinformed authors who believe that the method of operation of these models is, at once, unknowable in its divinations and clearly blatant in its workings as a device of perceived theft. Moreover, their lack of understanding of the accessibility, lack of necessary transparency, and ease of making LoRAs and finetunes against very specific datasets underlies the entire process. There is and will forever be no practical way of knowing what data any given model was trained on and hyper-specific writing-style models can be trained in hours or days on a corpus of works without informing anyone and deleted as soon as the text anyone needs generated is done. There will never be a way to watermark ascii characters, there will never be any way to obfuscate language against machine readability without also obfuscating it from human readability. It's a lost fight, in essence. The genie has flown the bottle and is unlikely to return even under pressure from governments, who would be grossly unable to govern this because they do not need to exist as large, corporate enterprises with budgets and head counts. They are relatively small, easily creatable and transferrable files. And all of this is on top of the already, to me, unimpressive legal argument that a series of float values constitute the legally protectable version of their written word. And again, I say this as an author and professional writer.

Anyway, please do link me to some references for those factors for Fair Use in international jurisdictions as they sound very specifically relevant to Machine Learning applications in their phrasing. No one in my family is a lawyer to the best of my knowledge at this time.

Edit: Also, apologies, I'll be done there. Not really the spot for the discussion I'm sure, but I was having fun.

@gozfarb These are not standards of Fair Use, but places where you need fair use (or other exception) if you don't have a license.

Pretty much every paper that discusses Copyright in ML covers these three parts of the training process.

I see the license was switched back to Apache 2.0

Sign up or log in to comment