Creating Flashcards with a Reasoning LLM

How I made an awesome Anki deck for studying alpine skiing terminology without writing a single line of code.

Jan 21, 2025

So, my father is a ski instructor, and he’s preparing for one of the high-level PSIA exams. For that, he asked for some help creating a study aid to learn a lot of the jargon used at this advanced level of skiing. These terms aren’t familiar to most skiers unless they’re professionals or high-level instructors. But knowing this jargon is essential for reading ski literature, engaging in conversations among ski instructors, and truly understanding the theory behind the practice. This kind of domain-specific vocabulary is key.

Given my background in language learning, I was more than happy to take on the challenge. This afternoon, I sat down and, in just a couple of hours, managed to turn the glossary from one of his books into what I feel are highly effective flashcards. These flashcards are designed with solid psychological principles to optimize the learning process. And the best part? I did it all without writing a single line of code or manually typing out any of the 300-plus cards.

Scanning and Extracting the Glossary

Here, I want to share my exact process. First, I used an app called TurboScan to take pictures of the 12-page glossary at the back of my dad’s book. TurboScan creates nice, high-contrast images, but some of the pages had blemishes that would have caused traditional OCR solutions to produce a lot of character recognition errors.

good luck using traditional OCR technologies on scans of this quality

To avoid this, I went page by page and pasted each image into ChatGPT 4o which now has vision capabilities. I asked it to convert each page into JSON. I quickly learned that I had to specify not to use Python, because otherwise, it would take the image as data and write a Python program that loads it into an OCR library like Tesseract. And Tesseract is pretty terrible compared to the built-in OCR capabilities of the GPT-4 with Vision model.

Now, there’s some debate—and from what I’ve read online, a bit of controversy—about how the OCR in ChatGPT 4o with vision actually works. Some people suspect that it supplements the neural network with some kind of traditional OCR technology. Regardless, the output clearly is passed through the LLM, and it’s able to smooth over what would normally be transcription errors. It can even infer what words or characters should be, even when there’s a blemish. I certainly didn’t catch it making any errors.

Crafting Creative Clozes

Next, I opened a new GPT session, this time using the o1 model. Here, I pasted in the JSON data one page at a time and asked the o1 model to generate flashcards. This is where I applied some of my domain-specific knowledge.

What I discovered was that if I simply asked o1 to generate cloze deletion flashcards—cloze deletion being a fancy term for fill-in-the-blank cards, where there’s a blank in the sentence on the front and the word on the back—it would tend to place the blank near the beginning of the sentence. From personal experience and a pet theory of mine (which I might have picked up from Foundations of Statistical Natural Language Processing by Manning and Schütze, a book I regard as much about the mind as about NLP), I know it’s far better to position the blank at the end—or near the end—of the sentence.

Here’s why: each flashcard is like a training session for the neural network in your brain. What you’re trying to do is get your brain into a specific state of semantic resonance based on the preceding words and then attempt to recall the key word. You want to teach your brain that, when it’s in that state, it should anticipate a specific term as the next logical step.

Let’s take an example: the word abduction. If the flashcard sentence is something like, ___ is when the skier slides their ski outward, the blank appears almost immediately. This forces the reader to awkwardly skip over it, read the rest of the sentence, and then reason backward to figure out what fits. But if we place the blank at the end, as in, When the skier slid her ski outward, she was demonstrating ___, the reading process naturally puts the brain into a state where that word is the next logical token. This setup effectively trains the brain on next-token prediction, much like how large language models are trained.

Think carefully about the principles of excellent cloze deletion flash cards for word learning (monolingual). We want to optimize mental training for in-context _use_ of the words, rather than training regurgitation of definitions. You will appreciate that this training works best when the blank is near the end of the example sentence. Then return a list with flashcards for each word in the given list of definitions. Use JSON format for your reply. { "abduction": "Movement of a limb away from the body's midline.", ...

In the prompt shared above, you can see I specifically instructed o1 to place the blank near the end of the sentence. o1 excels at generating really good, highly specific example sentences that make for effective flashcards. In the past, I’ve tried generating these kinds of cloze deletions with the 4o model, but I found that the results were often too ambiguous or generic. The output from o1, by contrast, is spot on.

Generating an Anki Deck (with great audio)

Now, I needed to write a Python script to transform the JSON output I had from the O1 model, complete with all the example sentences, into an Anki deck. Anki supports CSV imports and can also play media files, and I believe flashcards work best when they incorporate both written and spoken cues. So, I prompted o1 to generate exactly such a program.

Here, I applied some of my past experience. I’m not sure how o1 would have approached architecting this kind of program if I’d left it entirely up to its discretion, but I was quite specific in my instructions. From experience, I know that when creating a program like this, it’s common to encounter interruptions or crashes partway through. Losing all of your text-to-speech data in such cases can be a real frustration, so I made sure the script would include a temporary directory where all the audio files would be stored as MP3s. This way, the program could cache those files and reuse them instead of regenerating them every time.

I have a file "ski-terms.json" in this format: {... <i pasted a sample here>}

Write a Python program that (1) creates a tmp directory in the project dir, (2) for each of the terms and example sentences, use the OpenAI voice model to generate an MP3 file which you should store in the tmp dir. For the filename, convert the term to a filename-safe string (removing parentheticals, strip spaces, spaces-to-underscores, etc.) and prefix `term_` or `cloze_`, (3) do not re-do the expensive TTS if the output files already exist, (4) save to a CSV file in a format appropriate for importing into Anki as a deck.

For the text-to-speech, I asked o1 to use the OpenAI voice model. At first, it didn’t seem aware of that model (likely because the model wasn’t part of its training data), so I provided it with example code from the OpenAI website. After that, o1 happily updated the script to incorporate the voice model.

Use this OpenAI api:

from pathlib import Path
from openai import OpenAI

client = OpenAI()
speech_file_path = Path(__file__).parent / "speech.mp3"
response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="Today is a wonderful day to build something people love!",
)
response.stream_to_file(speech_file_path)

The resulting program worked well. However, I hadn’t anticipated one issue: underscores in the input text. The OpenAI TTS model handled underscores nicely when they appeared at the end of a sentence, sounding as though it was trailing off naturally. For example, in a sentence like This is known as a ___, it read the sentence exactly how you’d expect a human tutor to say it—trailing off slightly at the blank.

But when the underscores appeared mid-sentence, like in This is called the ___ zone, the TTS model struggled. It didn’t intuitively handle the pause or emphasis correctly, making it sound awkward. To fix this, I went back to the o1 model and instructed it to replace all underscores with the word BLANK before sending the text to the voice model.

OK, this is working, except that the TTS does not work well with the underscores. Please replace consecutive underscores with the word "BLANK" so it can be pronounced.

To my delight, this adjustment worked perfectly. The voice model not only read the word BLANK clearly but also emphasized it in a way that sounded natural—just as a human tutor would when helping someone fill in a missing term.

1×

0:00

-0:05

(The answer is affective, btw.)

Importing to Anki and Testing it Out

The final step—and the part where I went from feeling like I was water skiing to just walking again—was importing the generated CSV file into Anki and configuring the app to display the front and back of the cards correctly. Ultimately, this was very straightforward and felt more like clerical work than anything technical. I didn’t need to do any coding, but figuring out Anki’s interface on the computer was a bit wonky at first.

That said, the import worked seamlessly. The audio sounds great, and even with 300 cards, the file size was only about 30 megabytes. From there, I emailed myself a Google Drive link, opened it in the iOS version of the Anki app, and everything worked perfectly. (Link is to a video of the macOS app, but you get the idea.)

Sadly, Hallucinations

Unfortunately sometimes the model makes mistakes—what in the biz is called hallucinations. For example, even o1 really doesn’t have a solid understanding of physics.

extra points if you got that the answer should be *rotational inertia* or *moment of inertia*. Momentum on the other hand is the product of inertia and velocity, as correctly explained by the PSIA glossary that was used as input.

Thankfully these kind of errors seem fairly rare in these cards… now I need to go through them all myself and try spot the errors!

What a time to be a builder that we have such tools as these LLMs at our disposal. It occurs to me that, although LLMs may be getting rather good at writing software, they are also replacing software. I mean, rather than working with an LLM to build software to do X, it is getting to a point where it’s easier just to ask the model to do X. This is sad for programmers, but great if you want to do.

Merck's Musings

Discussion about this post