LearnLM shows why LLM tutors will get better
Google's LearnLM can be *automatically* evaluated by another LLM-based critic/judge who can be trained to evaluate according to their seven pedagogical benchmarks
Please note: This post is a review of Google’s technical report introduced on their May 14 blog.
Personalization is EdTech’s Holy Grail
Optimal pedagogy as a complex problem (manifold) in high-dimensional space
The first principles are to encourage active learning, manage cognitive load, deepen metacognition, stimulate curiosity, and adapt to learners [Sec 4.3.1]
Quality data is somewhat synthetic but Golden Conversations was up-weighted (2x) [Sec 3.4]
Benchmarks: AI improvement requires pedagogical measures
Summary
Personalization is EdTech’s Holy Grail
As an educator who built an online school based on my own personal experiences and biases about online learning1, Google’s technical report introducing their LearnLM-Tutor (“LearnLM”) is an enlightening peek into the future of generative AI in education. Their LLM is a conversational tutor and multi-turn conversations with a fine-tuned LLM do not represent the whole sprawling field of education, obviously. Still, I’m impressed at how deeply they are operationalizing progress in tutor evaluation. The paper convinces me that progress here is not a narrow achievement, it’s wide and profound. It’s also a lesson in the power of carefully crafted metrics; aka, benchmarks.
So the paper is a peek into the future of conversational tutoring’s invariable improvement. Even today, my own online learning is mostly non-personalized: when taking a course, I view the same videos in sequence as everybody else who’s enrolled; I read the same chapters that are assigned in sequence; and work on mostly the same Q&A exercises. For my own product, it was 17+ years ago that I introduced a customer support forum2 to socialize practice questions. I was thrilled by the idea that learners could follow-up on selected topics at their discretion, maybe focusing on introductory material if they were struggling, or debug harder problems if they were experienced. But a forum doesn’t scale personalized learning in the way that generative AI makes possible3.
For many years, the holy grail has been personalization. Human tutors don’t scale, but the claim is that a tuned LLM can scale the benefits of a great 1:1 learning experience. In his new book, Sal Kahn asks us to imagine:
“What might it be like if every student on the planet had access to an artificially intelligent personal tutor: an AI capable of writing alongside the student; an AI that students could debate any topic with; an AI that fine-tuned a student’s inherent strengths and augmented any gaps in learning; an AI that engaged students in new and powerful ways of understanding science, technology, engineering, and mathematics; an AI that gave students new ways of experiencing art and unlocking their own creativity; an AI that allowed for students to engage with history and literature like never before?” — Khan, Salman. Brave New Words: How AI Will Revolutionize Education (and Why That's a Good Thing) (pp. 8-9). Penguin Publishing Group. Kindle Edition.
Optimal pedagogy as a complex problem (manifold) in high-dimensional space
If you are a skeptic who doesn’t believe that LLMs can ever genuinely scale the best features of human tutor-student (1:1) learning, you’ll appreciate the paper’s guilty plea to the job’s almost comical degree of difficulty. It’s Google, after all, so this hard problem—this inherent challenge—is portrayed as a cute topographical surface called a manifold. After too many minutes trying to literally interpret the function (what are the axes? what is the pattern of the darkest?), I realized it’s more of a cool symbol.
The topographic symbol illustrates that learning (“pedagogical behavior”) has many dimensions that include domain type (math versus history), learner features (e.g., styles, motivation, prior knowledge), and instructional methods. But there are more than three axes because “implementation parameters (e.g. the time spacing between practices, the ratio of examples to questions) and can be combined in different ways, resulting in a combinatorial explosion in possible, often context-dependent, pedagogical strategies”. The undulating peaks mean to show there are many best teaching strategies (aka, local pedagogical optima) however none is overall best and, further, most are hard to find because the space is large and we can’t extrapolate from one best to another. Yes, that makes complete sense to me. Learning is complex, and therefore so is teaching, I think.
The first principles are to encourage active learning, manage cognitive load, deepen metacognition, stimulate curiosity, and adapt to learners [Sec 4.3.1]
They confront this complex problem by articulating first principles. The principles are operationalized into metrics. I cannot imagine everybody agreeing to this exact list of first principles, but—in my experience—once you agree on the principles list, the usable metrics are often easy to select. For example, you discover some of the potential metrics are hard to collect after all; and that other metrics are easy to collect and correlate anyways. Myself, I like their list. Their principles include (emphasis mine) the following where I listed sample metric in the sub-bullets:
Encourage active learning: Everybody seems to agree that “discussion, practice, and creation” are better than “passively absorbing information”.
Metric(s): Do not reveal the answer; guide towards the answer; promote active engagement
Manage cognitive load: The tutor should “present information in multiple modalities, structure it well, and segment it into manageable chunks.”
Metric(s): Stay on topic
Deepen metacognition: This is the “thinking about thinking” that enables learners to generalize to other domains
Metric(s): Identify and address misconceptions
Motivate and stimulate curiosity
Metric(s): Communicate with positive tone; respond appropriately
to explicit affect cues
Adapt to learners’ goals and needs: This includes current assessment and tailored planning
Metric(s): Adapt to the learners’ goals and needs; adapt to the learner’s level
Quality data is somewhat synthetic but Golden Conversations was up-weighted (2x) [Sec 3.4]
A theme in the paper is the difference between general large language models (LLMs) like Gemini or GPT-4o or Claude-3o, and tuned LLM tutors. General LLMs are trained to be helpful but “helpfulness may often be at odds with pedagogy and learning” and the tuned LLM tutor has a “narrower conversational goal”. According to the authors, good datasets that can train tutors don’t yet exist: “only four datasets [are] openly available to our knowledge, all of which suffer from limitations, such as a lack of grounding information, low tutoring quality, small dataset size, and noisy classroom transcriptions. Furthermore, most human tutoring data is focused only on language learning.”4
For their supervised fine-tuning (SFT), they created the following five datasets; the signifiers M0 to M4 refer to stages of the their tuned LLM tutor such that M4 is the best LearnLM:
[Human data but only for M0-1, not for M4] Human tutoring: use only for first two versions (M0 and M1) because it “has a number of limitations” including uneven quality
[Synthetic data for M1-M4] Gen AI role-play with human intervention: The LLM plays both tutor and learner. Although data is synthetic, the result is a “reasonably consistent (if sometimes stilted) pedagogical dialogue over very long conversations”.
[Synthetic data for M2-M4] GSM8k dialog: math word problems
[Human data for M3 and 2x for M4] Golden conversations: labor-intensive collaboration with human teachers to develop “a rubric that included a learning scenario or lesson as context, a minimal learner persona, and a set of [desired] behaviors to include”.
[90% for M3 and M4] Safety dataset5: “Our safety fine-tuning data consists of harm-inducing conversations and golden responses on lesson material across a wide range of subjects. Queries were either written by the team or taken from failures observed during automatic or human red-teaming.”
The GSM8K dataset is a large set of math world problems that “a bright middle school student should be able to solve,” according to their repo. For example, here’s the first one that I randomly sampled:
John had a son James when he was 19. James is now twice as old as his sister Dora, who will turn 12 in 3 years. How old will John's youngest son, who was born when John was 32, be in 3 years?
Both Claude-3-Opus and GPT-4o aced my prompt of this question.
As you can see, their supervised fine-tuning relied generally on synthetic data. Further they believe that “long [aka, multi-turn] conversations are crucial.” In this section, the two most interesting insights to me are:
While human conversations (i.e., human tutor to human learner) are more helpful to model stylistic attributes (e.g., encouragement), synthetic conversation are better and modeling substantive gaps (e.g., identification and correction of errors).
Their itemization of so-called sociatechnical limitations of a text-based LLM tutor in comparison to a human tutor [Section G]. The human tutor’s specific advantages include:
Contextual Understanding: AI lacks the real-world physical and social context that humans inherently understand and communicate with.
Personalization (notice how the same word is used in a different context): Human tutors know personal details about learners, like age and learning style, and build on past interactions, whereas AI struggles with gathering and effectively using such information.
Non-verbal Communication: Human tutors use facial expressions, body language, and tone to gauge and respond to learners' emotions, which AI systems in text-based environments cannot do.
Multi-modal Interaction: Human tutoring often involves using diagrams, objects, and writing together, but AI systems can't yet seamlessly interact across different media types.
Social Norms: Human tutors rely on social norms to guide learner behavior and teaching strategies, while AI may not effectively enforce or leverage these norms, leading learners to demand direct answers.
Benchmarks: AI improvement requires pedagogical measures
This is the paper’s chief contribution and achievement. For me, to read about their process and ponder the tradeoffs is to think about what learning means to me and how I would define it. They developed seven benchmarks:
Slow and expensive
Learner gives feedback via ratings on unguided conversations (5.1)
Teacher gives feedback via ratings on single prompt-reply exchanges (5.2.)
Teacher gives feedback via ratings on multi-turn conversations (5.3)
Teacher gives feedback via rankings on multi-turn conversations (5.4)
Semi-structured interviews with learners in a real-world setting (ASU Study Hall program)
Fast and cheap
An LLM-based critic (aka, LLM judge) automatically evaluates (aka, scores) the LearnLM according to whether LearnLM achieves a pedagogical task. For example: Did LearnLM adapt to the learners level? Yes = 1, No = 0.
Comparison of LearnLM’s response to human tutor: is a pedological reply (as a teacher would reply who has the goal of learning) more probable than a conversationally helpful reply?
Without going into methodological details—and there is much detail that is not in the paper—I’ll just share my summary interpretation:
In regard to the first two simpler evaluations (#1 and #2 above), LearnLM was statistically better than Gemini 1.0 in only two categories: it helped learners feel more confident, and teachers perceived that it was better at promoting engagement.
In regard to teachers’ rating feedback on multi-turn conversations (#3 above), LearnLM was preferred on all 25+ metric-attributes except for one.
In regard to teacher’s ranking feedback on multi-turn conversations (#4 above), LearnLM was better on 4 out of 5 categories (the fifth category was “which conversation is better of terms of the accuracy of statements made by the tutor?”; i.e., it was roughly a tie with respect to perceived accuracy).
In regard to the automatic LLM-based critic’s evaluation (#6 above)—which is arguably the most relevant finding in the paper—I’ve copied their Figure 10 below. The left exhibit below shows a high correlation between the LLM-based critics evaluations and humans’ evaluation, which tends to validate the employment of the robots. The right exhibit shows that LearnLM was ~16% better (that’s just me using a naïve average ratio of the critic scores).
Still in regard to the automatic LLM-based critic’s evaluation (above), we can highlight three of the first principles:
Managing cognitive load: LearnLM was better; e.g., “Stay on topic” ratio of 1.45 = 0.87 / 0.6.
Deepen metacognition: LearnLM was better; e.g., “Point out misconception” ratio of 1.5
Encouraging active learning: LearnLM was better at “Guide towards answer” (ratio of 1.58) and “Promote active engagement” (ratio 1.20) but worse at “Don’t reveal answer” (ratio of 0.82)
In my view, this LLM-based critic is the second link required to believe these tutors will inevitably improve: first, you have to sufficient agree with the designed principles and their operationalized metrics. Second, you have to believe the LLM-based critic can successfully evaluate AI tutor performance against those benchmarks on a par with humans. The authors have confidence in this second link, as they say that evaluating AI tutors is easier than building AI tutors:
“While prompting gen AI to generate pedagogically valid tutor responses is hard, we find that prompting gen AI to evaluate pedagogical dimensions (for critique-based auto-evaluations) is more successful. This is partly because … we break down pedagogy into specific dimensions, so that each critic only needs to evaluate a very specific capability in response to a dataset of prompts targeted at eliciting that capability. Our LLM critics also get access to privileged information … Finally, we can leverage much larger and more capable LLMs for evaluations, which would not be feasible due to cost and latency considerations in a user-facing system.”
—Page 21
Summary
I don’t think the statistically significant evaluations are the highlight. The paper is the story of a multidisciplinary team collaborating to develop their first draft of seven pedagogical benchmarks that can be used to evaluate an AI tutor’s performance; and how an LLM-based critic was trained to automatically employ those benchmark win evaluating AI tutors. That is, an LLM was trained to evaluate other fine-tuned LLM tutors. Sure, there’s a lot for humans still to do in there6, but it looks like a good recipe for improvement.
My school was inspired by lynda.com. Many years ago, when they were located in Ojai, I was overwhelmed by the power of learning illustrator and photoshop by watching an expert virtually over-their-shoulder as they tutored the software. She had an excellent process for designing and deploying courses hosted by fascinating people. The product was infectious.
To host a forum can be a massive commitment of time and energy. It will be interesting to see how generative AI impacts forums. Reddit’s achievement is rare. Most forums are not highly scalable.
Note about red teaming (9.4)
I think. Or, I’m teasing. I don’t know.
I think you may be right. I was very excited to read about these developments. Did you read the report? It gets a little testy at times. At one point, the education field is described as recently entering the 20th century with its emphasis on evidence. Let's just say... not the best moment in the report.