questionnaire-tts as a ux project

questionnaire-tts started as a practical little builder: take a folder of questionnaire text files, generate neural text-to-speech audio for each item, and ship a static web page that can play everything back. looked at as a UX project, though, the interesting part is not the synthesis itself. the project is about a very ordinary failure point in psychological testing: a client can be handed a mountain of paperwork, but the results are only meaningful if they can actually read and understand the items.

that sounds obvious until you picture the evaluation room. a client may be tired, anxious, neurodivergent, undereducated, dyslexic, visually strained, or functionally illiterate. they may be too embarrassed to say they cannot read a form well enough. meanwhile, the clinician still has to get through pages and pages of inventories, rating scales, symptom checklists, and consent-adjacent paperwork. if the client guesses, skips, misreads negations, or answers based on partial comprehension, the instrument can become invalid while still looking complete.

the motivation was extremely concrete: i had to read the PAI, the Personality Assessment Inventory, out loud to clients twice in as many weeks. the PAI is over 300 items. doing that manually once is tiring; doing it again immediately makes the inefficiency impossible to ignore. at that point the question becomes “what would it take to engineer a repeatable, less error-prone version of this?”

the input is intentionally boring: one question per line in a .txt file. the output is a local static site with mp3 assets, item numbering, previous/next controls, replay, and a questionnaire selection page. that simplicity matters because the software should not become another assessment task. it should make it easier to deliver questionnaire items consistently, at a humane pace, with fewer hidden assumptions about reading fluency.

ux problem

psychological questionnaires are often designed around paper, pdfs, or form fields. that design quietly assumes literacy: the client can decode the words, track the scale, notice qualifiers like “often” or “not,” and keep going for dozens or hundreds of items. for many evaluations, that is a fragile assumption.

the risk is not just inconvenience. it is validity. if the item is not read correctly, the response is not cleanly interpretable. if a client is ashamed to ask for help, the clinician may never know. if the clinician reads items aloud ad hoc, delivery can vary by fatigue, tone, speed, interruption, or skipped wording. the user has to coordinate several things at once:

where they are in the instrument
whether the current item has finished playing
how to repeat exact wording without making the client feel singled out
how to move backward or forward without losing the paper trail

the repo turns those into first-class interface states. the web player shows the current item number using human-friendly one-based indexing, keeps audio controls close to the text, and prevents accidental skipping while an item is still playing. the design goal is not to replace clinical judgment. it is to remove one avoidable source of measurement noise: uneven access to the words on the page.

interaction design

the core controls are deliberately redundant. there are visible buttons for mouse/touch users and keyboard shortcuts for clinicians or assistants moving quickly through a set:

left / right: previous and next item
space / R: replay current audio
Q: hidden administrator-only skip-ahead escape hatch

the skip behavior is intentionally asymmetric. the client-facing flow does not allow skipping ahead until the current item has been read completely, because premature skipping recreates the same validity problem the tool is trying to solve: a completed answer sheet where the client may not have heard the actual item. the hidden Q shortcut gives the administrator a way to jump forward if something breaks, gets duplicated, or needs to be bypassed for a legitimate reason, but it stays out of the visible interface so the client is not invited to race through the form.

that is the main UX win. the interface does not assume the operator wants to reach for a trackpad every time, and it does not require improvising aloud from a stack of forms. once you learn the rhythm, it becomes closer to a slide deck or media deck than a folder of audio files: exact item, exact replay, clear position, and controlled progression.

the multi-questionnaire selection page is another small but important piece. instead of dumping every mp3 into one output directory, the builder creates individual questionnaire pages and a landing page with cards. each card can show the form name, question count, and estimated duration, so choosing the next form becomes a scan-and-select task rather than file management. in an evaluation context, that matters: the less the administrator has to hunt around, the more attention they can keep on the client.

system design as ux

the build pipeline does a lot of the user experience work before the browser opens. edge-tts generates Microsoft neural voice mp3s, but the script also caches audio by default. if an mp3 already exists, it skips synthesis unless --force is passed.

that turns iteration from “wait for everything again” into “change the text, rebuild only what changed.” for a static tool, this is huge: it keeps content editing cheap enough that someone can correct an item, split a long battery into separate forms, or try a more intelligible voice without treating every rebuild like a deployment.

the command surface also reflects the project shape:

uv run pai_tts.py --build-all
uv run pai_tts.py items.txt
uv run pai_tts.py --serve-only
uv run pai_tts.py --build-all --force --voice en-US-GuyNeural

there is a clean path for the common case, a single-form path for testing, a serve-only mode for review, and an override for voice or cache invalidation. the project stays close to the reality of paperwork: many instruments, repetitive content, lots of small corrections, and a need for repeatable presentation.

what makes it feel finished

small affordances carry the experience:

audio completion tracking makes navigation less error-prone
forward navigation is locked until the current item finishes playing
the hidden Q shortcut gives administrators a controlled override when needed
stable item numbering helps users recover their place
cached synthesis makes the builder feel responsive after the first run
static output means the player can be served locally without accounts, servers, or a backend
the included Windows executable lowers setup friction for people who should not have to care about Python packaging

none of those are flashy, but together they move the repo from “script that makes audio” to “tool for protecting questionnaire administration from a predictable access problem.” the point is not that text-to-speech magically fixes literacy, disability, language, or comprehension. it is that standardized audio can make the reading demand visible and negotiable instead of silently baked into the score.

next pass

if i kept pushing this as a UX case study, i would focus on evaluation-room safeguards: clearer focus states, a visible shortcut reference, per-question completion states, optional pause points, and a simple administration log showing which items were played or replayed. i would also want guidance around when audio administration is appropriate for a given instrument, because standardization cuts both ways. the core idea is already there: if a test depends on understanding the item, the interface should help make that understanding more likely.