A new generative neural network called Riffusion has appeared on the network. With its help you can make music from text. The novelty is based on Stable Diffusion version 1.5.

The idea is that Stable Diffusion generates so-called sonograms or spectrograms – a visual representation of music. This is a normal flat image, where the x-axis represents the order in which frequencies are played from left to right, and the y-axis represents the frequency of the sound. The color of a pixel determines the amplitude of the sound at a given moment.

The working principle is simple: Stable Diffusion generates an image and translates it into a spectrogram, then converts the data into sound using Torchaudio’s sound processing library. The result is a music track. In this case, in the text query you can specify the genre: rock, jazz, and so on. You can even generate a typing sound on the keyboard.
You can try the novelty yourself here. For a comfortable game in Returnal on PC, you need 32 GB of RAM – system requirements have appeared on Steam.
Source: VG Times