⚾️ First Notes
Right off the bat, Music Maketh’s first month has been full of progress and excitement. By focusing first on deep study of academic research while developing our first iOS application, we’ve refined and expanded our theses and shipped. We hope you’ll enjoy the show! 🎸
🧧 A Brief Reading of Research
Our second newsletter serves as a concise meta-analysis of several dozen high-quality published research papers relevant to the art and science of musical intelligence and that demonstrate Music Maketh’s directional alignment with the peer-reviewed knowledge of today.
Ultimately, a clear collective conclusion emerges: Music is a controllable laboratory for predictive curiosity(18,46,45), which makes it an unusually clean substrate for alignment(18,46,43). Humans (and good music systems) optimize for learnable structure unfolding over time(18,46,45,28,29)–not raw novelty, not pure predictability(46,45), nor single and isolated proxies like “text match” or “audio fidelity”(43,15,16), nor crystallized permanents like algebra or law.
This learning and looping structure appears at three scales: in brains as prediction error and recalibration(18,28,29,30), in artificial intelligences as intrinsic reward for prediction-based exploration(46,45), and in generative music models as preference optimization(43,44).
Rather than rules, developmental alignment loops connect subjective human feedback to intelligences’ objective reward functions(46,45,43,44). Music Maketh’s products will be control surfaces for these loops(37,38,26,43) thereby advancing models that currently attenably generate to models that aim toward diverse human values(43,36,38).
Music is not “sound” or “notes.” Music is a temporally structured signal(18,28,29,30) whose meaning lives across predictive coding(18,45), learning(45,46), repetition, variation(1,2,3,11), timing(28,29,30), behavior(49,59), culture(11), and history(10,11). Architecturally, the field keeps rediscovering the same truth: symbolic representations buy you structure and editability(1,2,3,5,6,14), raw audio buys you timbre/voice/production(4,31,32), and hybrid signal priors buy you controllability without losing realism(33,34).
Interestingly, when we concede alignment’s formality to subjective human experience(46,43), representation is then downstream and everything–control, evaluation, personalization–becomes easier(15,16,MusicRL/-RL(43,44)). In other words, inputting the perfect words manually(15,16,43) to an intelligence in order to get a precise response is harder(6,4,32) than providing your mood, music(2,5,33), and/or feelings abstractly(6,2,5,33/34).
Music is a uniquely open and shared embedding space(15,16,41,40) where audio, video, motion, and language(15,16,41,40) can meet in symbiosis. Embeddings aren’t just nice-to-have retrieval tools; they’re the API layer between messy modalities—audio↔text(15,16), audio↔motion(40), video↔music(41), and even brain↔music reconstruction via latent priors(27). Common systems like CNNs and LLMs very effectively learn musical representations(12,14,15,16) during training (albeit often requiring robust processing procedures(12,13,14,16)). Harmoniousness and rhythm are clear examples of music data characteristics(18,28,30) that stretch across musicology, neurology, and computational theory(18,28,29,30,46). Embodiment, say “dance_waltz:song_in3” or “deceptive_cadence:anticipation”, expands the dimensions(40,29,49,50), accurately, of representations to at least the entire human neurosensory suite.
In co-creating with musical intelligence, representation has to be playable(36,37,38). Co-creation isn’t judged by ‘best composition,’ but by shared timing, legibility, and trust: the human must be able to predict what the model will do next(37,38), and the model must be able to adapt to spontaneous changes without breaking form(38). GenJam, Pachet, MaxMuse, Jukebox, and ReaLJam(36,37,26,38) each demonstrate verifiable “musical meaningfulness” that enable more step-wise generations without destructive mutations to compositional integrity(36,37,38,39). Ear-training and music therapy can provide even more representations(47,49,50,23), from both subjective and objective sources, of performance errors, expressivity, timing, cadence, and outcomes that further help formalize reward functions, training benchmarks, and creative outcomes.
Millions of years have evolved our minds around, from, and for our sensations(18,46). Anticipation and arousal(20,18) are central to our minds’ predictive coding functionality and self-relevant meaning(21). Musical stimulus is impressed deep within neuronal and physiological structures(28,29,30,20,49,50), such as oscillatory(28,30), pleasure(20), and motor(49,50) processes—we don’t just hear; we change in the gestalt(18,49,50). Neural signals don’t map to “notes”(27,16,15); they map to shared latent spaces(Kernel Sound ID,27). Representation is not “processing.” It’s the interface between biology and generation(27,43) and embeddings are the lingua franca of that interface.
Metrics aren’t music(6,12,Cífka), so evaluation must include humans, context, and tasks(44,43,38). When a metric becomes a target, it can stop being a measure—symbolic self-similarity scores can be matched by musically meaningless artifacts, which is why the gold standard stays human + task + context(6,44,43). “Musical rituals”–embodied, multi-dimensional, variable, continuous, connected–in the form of live and recorded performances provide far higher-quality signal(11,18,49,36, 37,38,44) versus, say, individual parameter sliders. Music is a privileged domain for building aligned feedback loops(18,21,28,29,30,49,50,43,45) because it’s structured, safe, embodied, memorable(21), and naturally preference-shaped(43,45) relative to the human brain.
Across domains, historical and modern research illustrates that music is among the richest human data.
Its computational formalization–to extrapolate, decompose, symbolize, leverage, etc.(10,1,2,3,4,6,32,33/34,43)–has proven possible, and is theoretically reliable and powerful for constructing aligned and accessible artificial intelligences.
📲 The Moodlist iOS App
Moodlist, “generate the perfect playlist for your mood”, is our first software product and will be free on iOS. The beta is available now for select newsletter subscribes and is launching to the App Store next week.

Seed the generation from inspiration in the form of musical selections from your music library or Apple Music’s complete catalogue, text prompt, and photos 🎶
📕 References
(11) Mauch, M., MacCallum, R. M., Levy, M., & Leroi, A. M. (2015). The evolution of popular music: USA 1960–2010.
(18) Vuust, P., et al. (2022). Predictive coding in music cognition.
(45) Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction.
(46) Schmidhuber, J. (2010). Formal theory of creativity, fun, and intrinsic motivation.
(49) Magee, W. L., et al. (2017). Neurologic music therapy in acquired brain injury: systematic review/meta-analysis.
(10) Xenakis, I. (1971). Formalized Music: Thought and Mathematics in Composition.
(1) Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.
(5) Dong, H.-W., Hsiao, W.-Y., Yang, L.-C., & Yang, Y.-H. (2018). MuseGAN: Multi-track sequential GANs for symbolic music generation and accompaniment.
(2) Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Shazeer, N., et al. (2018). Music Transformer: Generating music with long-term structure.
(3) Payne, C. (2019). MuseNet (OpenAI research report).
(4) Dhariwal, P., Jun, H., Payne, C., et al. (2020). Jukebox: A generative model for music.
(31) Engel, J., Resnick, C., Roberts, A., et al. (2017). Neural audio synthesis of musical notes with WaveNet autoencoders (NSynth).
(33/34) Engel, J., Hantrakul, L., Gu, C., & Roberts, A. (2020). DDSP: Differentiable digital signal processing.
(32) Kong, Z., Ping, W., Huang, J., Zhao, K., & Catanzaro, B. (2020). DiffWave: A versatile diffusion model for audio synthesis.
(35) Chandrasekaran, A., et al. (2021). Machine-learned acoustic design.
(6) Atassi, L. (2023). Generating symbolic music using diffusion models.
(6) Plasser, et al. (SCHmUBERT / discrete symbolic diffusion). Discrete diffusion probabilistic models for symbolic music generation.
(Cífka) Cífka, O., et al. (2019). Supervised symbolic music style translation using synthetic data. (ismir2019_paper_000071)
(12) Dieleman, S., & Schrauwen, B. (2014). End-to-end learning for music audio / CNNs for music classification.
(13) (Boundary detection / structural segmentation paper). Music boundary detection using neural networks on combined features and two-level annotations.
(14) MusicBERT (2021). Symbolic Music Understanding with Large-Scale Pre-Training.
(15) MuLan (2022). A joint embedding of music audio and natural language.
(16) CLAP (2023). Learning audio concepts from natural language supervision.
(40) Li, et al. Audio to Body Dynamics.
(41) Choi, et al. (2024). Video2Music: Suitable music generation from videos using an affective multimodal transformer model.
Video2Music- Suitable Music Gen…
(20) Salimpoor, V. N., et al. (2011). Dopamine release during anticipation and experience of peak emotion to music.
(28) Large, E. W., & Jones, M. R. (1999). Dynamic Attending Theory.
(29) Fujioka, T., et al. (2012). Internalized timing represented in neuromagnetic beta oscillations.
(30) Nozaradan, S., et al. (2011/2016). Neural entrainment to beat and meter (frequency tagging / meter markers).
(21) Janata (2009). Music-evoked autobiographical memory and DMN framing.
(23) Ramirez et al. (year per your notes). EEG emotion-adaptive / emotion task performance with music.
(17) Zuk et al. (2024). Music-selective units in neural nets
(24) Miranda et al. (2005). Brain–Computer Music Interfacing / Toward Direct BCMI.
(26) MaxMuse (2023). Brain signals for real-time musical applications.
(27) Ciferri et al. (2025). Reconstructing music perception from brain activity using a prior guided diffusion model.
4Reconstructing music perceptio…
(25) BrainiBeats (CHI 2023). Dual-user neural synchrony → music (abstract in your notes).
(36) Biles, J. A. (1994). GenJam: A genetic algorithm for generating jazz solos.
(37) Pachet, F. (2003). The Continuator: Musical interaction with style.
(38) ReaLJam (2025). Real-time human–AI music jamming with RL-tuned transformers.
(39) Oore et al. (2024). SmartLooper (notes placeholder in your draft).
(Metacreation live eval paper) Evaluating musical metacreation in a live performance context.
(44) Eigenfeldt, A., & Pasquier, P. (2011). Audience preference modeling / evaluation of metacreation (your historical bridge).
(43) Agostinelli et al. (2024). MusicRL: Aligning music generation to human preferences.
(47) D’Ignazio et al. (2023). AI-based instrument tutoring (real-time adaptive feedback).
(48) TELMI Project (2020+). Technology Enhanced Learning of Musical Instruments.
(50) Thaut et al. (2007–2022 range in your notes). Rhythmic Auditory Stimulation (RAS) for gait / Parkinson’s falls study and related work.

