LangueViz – Alexander Epple

As part of the Computer Vision course, our group developed LangueViz, a system for detecting spoken language from audio using mel spectrograms and a DenseNet-based CNN. The project explored whether language classification could be approached as a visual recognition task by transforming short speech recordings into spectrogram images and applying transfer learning.

Overview

The goal of the project was to classify short audio clips by spoken language using computer vision methods rather than conventional speech-processing pipelines. The central idea was that spectrograms contain enough structural information for a CNN to distinguish between languages visually. As a proof of concept, the project focused primarily on German and English, while also including other languages and non-language audio as additional classes to make the setting more realistic.

The final system, LangueViz, converted audio into fixed-size mel spectrograms and used a pretrained DenseNet as the main classifier. We relied on image-based features, as well as transfer learning rather than handcrafted audio features alone.

Method

The training pipeline first sampled short audio clips, converted them into mel spectrograms, and fed them into a pretrained DenseNet for supervised classification. The input pipeline used clips of fixed duration and limited the frequency range to the bandwidth most relevant for human speech. For the training setup, we combined language data from Common Language and Common Voice with non-language samples from UrbanSound8K.

Within the project, I worked extensively on the training and evaluation pipeline, including model design choices, experiment tracking, and hyperparameter optimization. In particular, I used Weights & Biases extensively for sweeps, configuration tracking, and comparative analysis across training runs. This made it possible to evaluate the influence of factors such as clip length, layer freezing, and the proportion of noise and out-of-distribution language samples more systematically.

The sweep results showed several clear trends: longer audio clips improved performance, fine-tuning the full network worked better than updating only the last layers, and adding a limited amount of noise or other-language samples improved robustness, although too many unsupported-language samples had a negative effect.

Challenges

One of the main technical challenges was separating spoken-language content from noise and unsupported inputs. To address this, we introduced a two-head setup: one head for distinguishing language from non-language, and a second head for predicting the actual language class. A second challenge was the limited size of the original Common Language dataset, which led to overfitting and motivated the use of the larger Common Voice corpus.

Handling the scale of Common Voice also introduced practical infrastructure problems. We initially planned to use a streaming setup, but due to memory limitations this was not feasible in practice. These issues made the project as much an exercise in experimental design and data handling as in model training itself.

Results

The trained models achieved very good validation accuracy, and transfer learning performed competitively against training from scratch. However, qualitative testing revealed a much more important limitation: the model adapted poorly to real-world inputs. In practice, the system appeared to rely on dataset-specific characteristics rather than learning language identity robustly enough for broader deployment.

F1 score for the “pretrained” and “from scratch” configurations show similar performance.

This outcome was one of the most important findings of the project. Rather than simply treating the validation metrics as a success signal, we concluded that the model was likely learning underlying dataset patterns, such as recording quality or microphone characteristics, instead of the spoken language itself. That made the project especially valuable as a practical lesson in distribution shift and dataset bias.

Conclusion

LangueViz was certainly an interesting project because it explored how far an image-based approach can be pushed for a problem that is usually framed as speech processing. It combined transfer learning, dataset engineering, hyperparameter optimization, and structured evaluation in a compact but technically meaningful pipeline.

Equally important, the project highlighted a realistic research outcome, in that a model can perform well on validation data while still failing to generalize for the right reasons. That made the work particularly valuable as an exercise in critical evaluation, experimental rigor, and understanding the gap between measured performance and actual robustness. The implementation and weights are available on GitHub and all the runs were tracked on Weights & Biases.