Bar-JEPA – Alexander Epple

This project presents Bar-JEPA, a method for extracting numerical values from bar charts using a joint-embedding predictive architecture. The work focuses on chart understanding under limited labeled data availability and investigates whether self-supervised pretraining can provide more useful visual representations for chart de-rendering tasks. A short summary is provided here, while the full paper is available in the Papers section.

Overview

Bar charts are widely used in scientific and technical communication, but recovering the underlying numerical data automatically remains a difficult problem. In machine learning settings, this is further complicated by the limited availability of annotated real-world chart data. The central idea of this project was therefore to improve chart value extraction by using self-supervised feature learning instead of relying purely on end-to-end supervised training.

The resulting method, Bar-JEPA, uses a finetuned I-JEPA encoder to produce semantically rich latent features for bar charts. These features are then consumed by a lightweight decoder that predicts the positions of bars, ticks, and the coordinate system origin, from which per-bar numerical values can be recovered. In addition to the extraction pipeline itself, the project also introduced variable-resolution input support for I-JEPA and a synthetic dataset generator for large-scale bar chart training.

Method

The pipeline begins with a bar chart image that is processed by a modified I-JEPA encoder. Since chart aspect ratios can vary considerably, I extended the architecture to support variable-resolution inputs while preserving the original chart layout more faithfully than fixed-resolution resizing. The encoder itself was finetuned in a self-supervised manner on a large corpus of synthetic bar charts in order to adapt its latent representations to chart structure and geometry.

Variable-resolution input patchification

To train the pipeline, I generated a synthetic dataset of 100,000 bar charts for encoder finetuning and an additional labeled dataset for decoder training. The generated charts were intentionally diverse in layout, typography, colors, and legend placement, while remaining restricted to vertical bar charts in order to keep the problem well defined. Each chart was accompanied by precise annotations for bars, ticks, and other relevant chart elements.

The decoder was designed to remain intentionally lightweight. It upsamples the frozen encoder features and predicts heatmaps for bars, ticks, and the chart origin. These predictions are then post-processed to recover candidate coordinates, which are refined through non-maximum suppression and matched with OCR-extracted tick labels. Finally, a regression step maps the detected bar positions to numerical values, producing the recovered chart data.

Results

The evaluation showed that self-supervised finetuning of the encoder had a substantial impact on downstream performance. Compared to a baseline using the unmodified pretrained encoder, the finetuned model achieved markedly stronger bar and tick detection and was the only variant that recovered values reliably on real-world charts. The results also showed that variable-resolution inputs improved value recovery further, while the simpler decoder variant proved too limited for this task.

Although the method does not aim to be a full chart-to-table system, it performed competitively in the targeted task of per-bar value recovery. In particular, the results suggest that semantically stronger latent features can reduce the amount of supervised complexity needed in the downstream model. This was one of the central findings of the project and supports the broader idea that representation quality matters strongly for document and chart understanding tasks.

Conclusion

This project extended my work in computer vision and document understanding by focusing on chart de-rendering under limited labeled data. It combined self-supervised learning, synthetic data generation, structured prediction, and quantitative evaluation into a single pipeline. Beyond the method itself, the project was particularly valuable as an exploration of how representation learning can improve data efficiency in a vision task with strong geometric and semantic structure.

The work also identified several natural next steps, including stronger decoders, support for additional chart types, and potential integration with multimodal systems for broader chart understanding tasks. The full paper contains further implementation details, ablation studies, and quantitative comparisons. The code is available on GitHub and all training runs are on Weights & Biases.

Overview

Method

Results

Conclusion

Leave a Reply Cancel reply