aimode.news
Published on

CS336: Language modeling from zero

Authors

Logistics

- Lectures: Monday/Wednesday 15:00–16:20am in the Skilling Auditorium

- Recordings: YouTube playlist

-

Office hours:

- Percy Liang: Friday 11-12 in Gates 366

- Tatsu Hashimoto: Tuesday 11–12 in Gates 364

- Marcel Rød: Tuesday 16:30–17:30 in Gates 498, Wednesday 16:30–17:30 in Gates 415

- Herman Brunborg: Wednesday 13:30–14:30, Friday 13:30–14:30, Gates 392 location

- Steven Cao: Mondays 16:30–17:30, Thursdays 9:30–10:30, Gates 200

- Contact: Students should ask all cursor-related questions in public Slack channels. All announcements are also made in Slack. For personal affairs, email cs336-spr2526-staff@lists.stanford.edu. Contents

What is this course about? Language models serve as cornerstones of modern applications for the processing of natural language (NLP) and open a new paradigm in which a single all-purpose system covers a series of downstream tasks. As the field of artificial intelligence (AI), machine learning (ML) and NLP continues to grow, a deep understanding of language models for scientists and engineers becomes equally crucial. This course is intended to provide students with a comprehensive understanding of language models by conducting them through the entire process of developing their own language models. We are inspired by operating system courses that create a complete operating system from scratch, and guide students through all aspects of language model creation, including data acquisition and purification for pre-training, transformer model construction, model training and evaluation before deployment. Requirements

-

Knowledge PythonMost of the class tasks will be done in Python. Unlike most other AI courses, students get only a minimal framework. The amount of code you will write will be at least one order of magnitude larger than other classes. Therefore, knowledge in python and software engineering is of the greatest importance. -

Experience with deep learning and system optimization

An essential part of the course will be to quickly and efficiently develop neuronal language models GPUsto run s on several computers. We expect students to be familiar with PyTorch and know basic system concepts such as memory hierarchy. -

College Infinite Balance, Linear Algebra (e.g. MATH 51, CME 100)

You should be familiar with matrix/vector notation and operations. -

Basic probability theory and statistics (e.g. CS 109 or equivalent)

You should know the basics of probabilities, Gaussian distributions, average, standard deviation, etc. -

Machine learning (e.g. CS221, CS229, CS230, CS124, CS224N)

You should be familiar with the basics of machine learning and deep learning. Note that this is a class with 5 units. As this is a very high rate of conversion, please plan enough time for it. Course work

Tasks

-

Task 1: Basics

- Implement all components (tokenizer, model architecture, optimizer) to train a standardTransformer- language models are required. - Train a minimal language model. -

Task 2: Systems

- Profil and compare the model and levels from Task 1 with advanced tools and optimize Attention with your own Triton implementation of FlashAttention2. - Create a storage-efficient, distributed version of the training code for the model “ Task 1”. -

Task 3: Scaling

- Understand the function of each component of the transformer. – Ask a training sessionAPI to adapt a scaling law to the scaling of the project model. -

Task 4: Data

- Convert raw common crawl dumps into usable pre-training data. - Perform filtering and dedupplication to improve model performance. -

Task 5: Orientation and Argumentation RL

- Apply supervised fine tuning and gain learning to teach LMs in solving mathematical problems logical thinking. - Optional part 2: implement and apply security alignment methods such as DPO. GPU calculator for self-study

If you participate at home, you can access the GPU computing power of a cloud provider to do the tasks. Here are some options (public prices for a single B200 GPU on 28. March 2026:

- Modal (sponsor): $6.25/hour. Offers $30 free monthly computing. Only the actual computing power is charged to them (no unused resources) and their UX makes the change between local development and large GPU experiments easy. (Modal prices)

- Lambda Labs: $6.69/hour (Lambda prices)

- RunPod: $4.99/hour (RunPod prices)

- Nebius: $5.50/hour (3.05/hour on call) (prices for Nebius)

- Together: $7.49/hour, at least 8 GPUs, cheaper at longer runtimes (column price)

For simplicity and to save money, we recommend the correctness of your implementation on the CPU to debugging and then use GPU(s) (with the number recommended in the tasks) to complete training courses (A1, A4, A5) or to benchmark GPU operations (A2). Code of Conduct

Like all other courses at Stanford, we take the honorary code of students seriously. Please note the following guidelines: - Collaboration: Learning groups are allowed, but students must understand and do their own tasks and give a job per student. If you have worked in a group, please enter the names of the members of your learning group at the top of your job. Please contact us if you have any questions about the Cooperation Directive. - AI tools: Requirement of LLMs and ChatGPT is permissible for low-level programming questions or high-level conceptual issues on language models, but direct use to solve the problem is prohibited. We strongly recommend that you complete the AI (e.g. Cursor Tab, GitHub CoPilot) to disable in your IDE when you do tasks (although non-KI autocompletion, e.g. automatic completion of function names, is completely fine). We have found that AI autocompletion makes it much more difficult to deal intensively with the contents. See the AI directive (inspired by it). - Existing Code: Implementations for many of the things you will implement are available online. The handouts we provide are in themselves closed so that you do not have to consult third-party code for creating your own implementation. Therefore, you should not view an existing code unless something else is specified in the handouts. Exemption of tuition services

- All course work will be submitted in due time via Gradescope. Do not submit your studies by e-mail. - If something goes wrong, please ask a question in Slack or contact a course assistant. - You can submit as often as you like until the deadline expires: We only rate the last submission. - Partial work is better than not to submit work. Late days

- Every student has 6 late days to use. A late day extends the deadline by 24 hours. - You can use up to 3 late days per order. grading requests

If you believe that the course staff is subject to an objective error during the listing, you can submit a request for a high level of Gradescope within 3 days of publication of the grades. Sponsor

We would like to thank Modal for the sponsoring of Compute for this course. Schedule (YouTube-Playlist)

| | Date | Description | Course materials | Time limits | |

|-----------------

| 1 | Monday, 30. March | Overview, Tokenisierung [Percy] | Lecture 01.py |

Task 1

[Code] [Preview]

|

| 2 | Mi 1 April | PyTorch (einops), resource accounting (FLOPs, memory, arithmetic intensity) [Percy] | Lecture 02.py (record version) | || || ||

| 3 | Monday, April 6 | Architectures, Hyperparameter [Tatsu] | Lecture 3.pdf ||

| 4 | Mi 8 April | Attention alternatives and expert mix [Tatsu] | Lecture 4.pdf ||

| 5 | Monday, 13. April TPUs [Tatsu] | Lecture 5.pdf ||

| 6 | Mi 15. April | Kernel, Triton [Percy] | vortrag 06.py |

Task 1 due

Task 2 from [Code] [Preview] |

|

| 7 | Monday, 20. April | Parallelity [Percy] | Lecture 07.py | ||

| 8 | Mi 22. April | Parallelity [Tatsu] | vortrag 08.pdf || ||

| 9 | Monday, 27. April | Scaling laws [Tatsu] | vortrag 09.pdf |

| 10 | Mi 29. April | Conclusion [Percy] | Lecture 10.py |

Task 2

Task 3 from [Code] [Preview] |

|

| 11 | Monday, 4. May | Scaling laws [Tatsu] | lecture 11.pdf |

| 12 | Mi 6. Mai | Review [Percy] | Lecture 12.py |

Task 3

Task 4 from [Code] [Preview] |

|

| 13 | Monday, 11. May | Data (sources, records) [Percy] | Lecture 13.py |

| 14 | Mi 13. Mai | Data (filtering, dedupplication, mixture, synthetic data) [Percy] | Lecture 14.py |

| 15 | Monday, 18. May | Mid/after training (SFT/RLHF) [Tatsu] | lecture 15.pdf ||

| 16 | Mi 20. May | After training – RLVR [Tatsu] | lecture 16.pdf |

Task 4 due

Task 5 from [Code] [Preview] [Optional part 2] |

|

| Monday, 25. May | No lessons (day of reflection) | ||||

| 17 | Mi 27. May | Orientation – Multimodality [Percy] | vortrag 17.py |

| 18 | Monday, 1 June | Guest lecture: Daniel Selsam | |||

| 19 | Mi 3 June | Guest lecture: Dan Fu | Task 5 due |

![CS336: Language modeling from zero](https://jjhwftqjccwqwubkfvke.supabase.co/storage/v1/object/public/images/articles/cs336-language-modeling-from-scratch.jpg)

CS336: Language modeling from zero | aimode.news