AI Decipher File · 3 November 2022 (filing) through 2024 (most claims dismissed; breach-of-contract claim survived; settlement of certain claims)
Doe v. GitHub Copilot November 2022: When the Open-Source Code Licensing Question Met an AI Coding Assistant
On 3 November 2022 a class-action lawsuit was filed in the United States District Court for the Northern District of California against GitHub, Microsoft, and OpenAI on behalf of anonymous open-source software developers (Doe 1 through Doe N). The complaint alleged that GitHub Copilot, trained on public GitHub repositories including substantial open-source code under licenses (GPL, MIT, Apache 2.0, others) requiring attribution and license preservation, produced code completions that reproduced training code without attribution, violating the open-source licenses. The case is the foundational AI-coding-assistant training-data lawsuit and proceeded through 2023-2024 with most claims dismissed but the breach-of-contract claim surviving in 2024.
Failure pattern
AI coding assistant training on permissively-licensed open-source code with attribution requirements that the assistant's outputs did not preserve
Organizations involved
Anonymous plaintiff class (Doe 1-N) of open-source developers, GitHub, Inc., Microsoft Corporation, OpenAI, United States District Court for the Northern District of California
Incident summary
On 3 November 2022 a class-action lawsuit was filed in the United States District Court for the Northern District of California (Doe v. GitHub Inc., Case No. 4:22-cv-06823) against GitHub, Microsoft, and OpenAI. The complaint, brought by anonymous open-source software developer plaintiffs (Doe 1 through Doe N) and represented by the Joseph Saveri Law Firm with attorney Matthew Butterick, alleged that GitHub Copilot trained on public GitHub repositories including substantial open-source code under licenses (GPL, MIT, Apache 2.0, BSD, others) that require attribution and license preservation.
The complaint alleged that Copilot's code completions reproduced training-code patterns without preserving the attribution and license information the original code required. Claims included violation of the Digital Millennium Copyright Act (DMCA) Section 1202 (preservation of copyright management information), tortious interference with open-source contributor agreements, breach of contract with the GitHub Terms of Service, and unjust enrichment.
The case proceeded through 2023 and 2024 with motions to dismiss and amended complaints. Most claims (including DMCA, fraud, tortious interference, and California unfair competition) were dismissed across the docket; the breach-of-contract claim (specifically against GitHub for alleged violation of GitHub's own Terms of Service obligations to repository owners) survived 2024 motion practice. Certain claims and parties have been the subject of partial settlement; the broader question of AI-coding-assistant training-data legal exposure remains substantially open.
Failure technique
The legal-technical pattern is reuse of permissively-licensed open-source code in foundation-model training where the model's outputs reproduce code patterns without preserving the attribution and license requirements the original code carried. Most permissive open-source licenses (MIT, BSD, Apache 2.0) do not restrict reuse but do require attribution and preservation of license text; copyleft licenses (GPL variants) carry stronger requirements.
The technical question is whether the model's output (a code completion) is a derivative work of the training code (which would require license-compliance), an independent expression (which would not), or something between. Per the US Copyright Office July 2024 report on Generative AI Training, the legal questions are open across multiple dimensions. The specific case-law trajectory of Doe v. GitHub is foundational for the coding-assistant subcategory.
GitHub Copilot's product has continued evolving through the litigation period. GitHub added a duplication-detection filter (released 2023) that suppresses code completions that match training code above a threshold. The filter is GitHub's operational response to the memorization concern; the legal-defense scope is broader than the filter alone.
Impact and consequences
Direct commercial impact on GitHub, Microsoft, and OpenAI has been operationally manageable but legally distracting. GitHub Copilot has continued growing substantially through the litigation period (the one-million-paying-subscribers milestone in October 2024 followed the litigation start by nearly two years). The legal cost and operational distraction is the steady-state expense rather than an existential issue.
Industry impact: the case is the foundational AI-coding-assistant training-data litigation. Subsequent AI coding assistants (Cursor, Codeium / Windsurf, Sourcegraph Cody, Tabnine, Replit Ghostwriter / AI Agent) have all addressed training-data sourcing, attribution-preserving outputs, and duplication-detection more explicitly than 2022-era products did. Per-product behavior varies substantially; the legal-defense surface has converged toward (1) attribution preservation in code-block outputs where appropriate, (2) duplication-detection filters, and (3) explicit user-facing licensing terms.
Open-source community impact: the Doe v. GitHub case is the most-cited 2022-2024 legal event in the open-source community's conversation about AI training on open-source code. The Open Source Initiative, the Free Software Foundation, and Creative Commons have all engaged with the question. The conversation feeds into subsequent permissive-license updates and open-source AI training-data norms.
Lessons for builders
Treat open-source-code license terms as binding when training AI coding assistants on the code. Permissive licenses are not 'no license' — they impose attribution and license-preservation obligations. AI Strategy Lead owns the regulatory-engagement posture; AI Engineer owns the operational implementation of attribution-preserving outputs.
Build duplication-detection filters as a baseline feature for AI coding assistants. GitHub Copilot's duplication-detection filter is the operational model; per the Doe litigation precedent, the filter is now a defensive expectation rather than a differentiating feature.
Distinguish copilot/completion output from generated-code-block output in attribution handling. A short tab-completion that matches a common pattern is different from a multi-line code block that reproduces a specific algorithm with structure; the attribution-preservation obligation scales accordingly.
Engage with the open-source community deliberately. The Doe v. GitHub case grew out of the open-source community's response to Copilot's launch; subsequent AI coding assistants have explicitly engaged with open-source community concerns to manage the legal and reputational surface.
Mitigations
What builders should put in place to address the failure pattern. Each mitigation maps to operational practice the relevant Applied AI roles own.
- ›Treat open-source-code license terms as binding when training AI coding assistants on the code.
- ›Build a duplication-detection filter as a baseline feature for AI coding assistants; GitHub Copilot's filter is the operational model.
- ›Distinguish short completion output from generated-code-block output in attribution handling; the attribution-preservation obligation scales with output structure.
- ›Document training-corpus composition and licensing posture for open-source code sources.
- ›Engage with the open-source community (OSI, FSF, Creative Commons) on training-data and attribution concerns.
- ›Offer customer-facing indemnification (GitHub Copilot's IP indemnification commitment in 2023 is one model) as a commercial response to the training-data IP risk.
Related Applied AI roles
The Applied AI roles whose day-to-day work would have prevented, detected, or contained this incident.
- AI Strategy Lead: An AI Strategy Lead owns organizational AI strategy and prioritization at the company level.
- AI Engineer: An AI Engineer builds production cybersecurity-relevant AI systems integrating LLMs, embeddings, and retrieval pipelines.
- Senior AI Product Manager: A Senior AI Product Manager owns AI product strategy across multiple feature areas.
- AI Product Manager: An AI Product Manager owns AI-powered product features and the roadmap that ships them.
Companies central to this incident
Read the DecipherU Applied AI company profiles for the organizations whose decisions, products, or research shaped this incident.
- GitHub Copilot (Microsoft / GitHub): AI pair-programmer integrated into GitHub, VS Code, JetBrains, and CLI workflows
- OpenAI: Frontier large language models and consumer + API AI products
Related AI Decipher Files
- New York Times v. OpenAI (Dec 2023): The Copyright Case That Defines AI Training Liability
- Authors Guild v. OpenAI September 2023: When the Major Book-Author Class Action Joined the Generative-AI Training-Data Copyright Docket
- Stability AI v. Getty Images February 2023: When Image Generators Faced Their First Major Training-Data Copyright Lawsuit
Frequently asked questions
What did Doe v. GitHub allege?
Per the complaint (Doe v. GitHub Inc., Case No. 4:22-cv-06823, N.D. Cal., 3 November 2022), the anonymous open-source developer plaintiffs alleged that GitHub Copilot was trained on public GitHub repositories including open-source code under licenses (GPL, MIT, Apache 2.0, BSD) that require attribution and license preservation, and that Copilot's code completions reproduced training-code patterns without preserving the attribution and license information the original code required.
What is the current status of the case?
Through 2023-2024, most claims (DMCA Section 1202, fraud, tortious interference, California unfair competition) were dismissed across the docket. The breach-of-contract claim against GitHub for alleged violation of its own Terms of Service obligations to repository owners survived 2024 motion practice. Certain claims and parties have been the subject of partial settlement; the broader question of AI-coding-assistant training-data legal exposure remains substantially open.
How has GitHub Copilot evolved during the litigation?
GitHub added a duplication-detection filter in 2023 that suppresses code completions that match training code above a threshold. The filter is GitHub's operational response to the memorization concern. Per October 2024 GitHub statements, Copilot crossed 1 million paying subscribers during the litigation period.
What does the case teach AI coding assistant builders?
Treat open-source code license terms as binding when training AI coding assistants on the code; permissive licenses impose attribution and license-preservation obligations. Build duplication-detection filters as a baseline feature. Distinguish completion output from generated-code-block output in attribution handling. Engage with the open-source community deliberately on training-data and attribution concerns.
Which Applied AI roles work on AI coding assistant training-data IP?
AI Strategy Lead owns the regulatory-engagement and external-counsel posture. Senior AI Product Manager and AI Product Manager own the product-level decisions about attribution-preservation and duplication-detection features. AI Engineer owns the operational implementation of the duplication-detection filter and attribution-preserving output pipelines.
Sources
- Doe v. GitHub Inc., Case No. 4:22-cv-06823 (N.D. Cal., filed 3 November 2022) — Complaint and Amended Complaints
- Joseph Saveri Law Firm + Matthew Butterick, GitHub Copilot litigation hub (githubcopilotlitigation.com)
- Matthew Butterick, "GitHub Copilot litigation" (Butterick's public writing on the case background)
- GitHub, "GitHub Copilot litigation update" (GitHub corporate response)
- United States Copyright Office, "Copyright and Artificial Intelligence, Part 3: Generative AI Training" (Pre-Publication Report, 9 May 2025)
DecipherU is not affiliated with, endorsed by, or sponsored by any company listed in this directory. Information compiled from publicly available sources for educational purposes.
Where to go next
Three next steps depending on where you are. The first two are free.
Free · 2 minutes
Start with the AI Risk Score
Two minutes. Tells you how exposed your current role is to AI automation and which defensive moves carry the best return.
Start the AI Risk Score →Paid program · $147-$597
Aligned course: SOC Analyst Fundamentals
Capstone reviewed by the founder, published rubric, Ed25519-signed verifiable credential on completion.
View the course →Free account
Save your results and track progress
A free account stores your assessments, recommendations, and an exportable copy of your Career DNA. No card needed.
Create your account →Get cybersecurity career insights delivered weekly
Join cybersecurity professionals receiving weekly intelligence on threats, job market trends, salary data, and career growth strategies.
By subscribing you agree to our privacy policy. Unsubscribe anytime.