hmByT5 - Preliminary Language Models

Preliminary Historical Multilingual and Monolingual ByT5 Models. Following languages are currently covered:

English (British Library Corpus - Books)
German (Europeana Newspaper)
French (Europeana Newspaper)
Finnish (Europeana Newspaper)
Swedish (Europeana Newspaper)
Dutch (Delpher Corpus)

More details can be found in our GitHub repository.

Leaderboard

We test our pretrained language models on various datasets from HIPE-2020, HIPE-2022 and Europeana. The following table shows an overview of used datasets.

Language	Dataset	Additional Dataset
English	AjMC	-
German	AjMC	-
French	AjMC	ICDAR-Europeana
Finnish	NewsEye	-
Swedish	NewsEye	-
Dutch	ICDAR-Europeana	-

Current best models:

Model	English AjMC	German AjMC	French AjMC	Finnish NewsEye	Swedish NewsEye	Dutch ICDAR	French ICDAR
`hmbyt5/byt5-small-english`	85.65 ± 1.21	87.27 ± 0.50	84.44 ± 0.79
`hmbyt5-preliminary/byt5-small-english-german`	85.74 ± 0.72	87.45 ± 0.67	84.23 ± 0.65
`hmbyt5-preliminary/byt5-small-english-german-french`	85.61 ± 0.96	87.24 ± 0.76	84.39 ± 0.68
`hmbyt5-preliminary/byt5-small-english-german-french-finnish`	85.30 ± 1.14	87.37 ± 0.53	84.12 ± 0.42
`hmbyt5-preliminary/byt5-small-english-german-french-finnish-swedish`	85.40 ± 0.78	87.12 ± 0.19	84.41 ± 0.34
`hmbyt5-preliminary/byt5-small-english-german-french-finnish-swedish-dutch`	85.51 ± 0.68	87.58 ± 0.39	84.39 ± 0.83	55.46 ± 1.99	73.38 ± 2.45	84.80 ± 0.44	75.97 ± 0.55
`hmbyt5-preliminary/byt5-small-multilingual-4g`	83.49 ± 0.96	87.65 ± 0.63	84.16 ± 0.90
`hmbyt5-preliminary/byt5-small-multilingual-4g-2e`	83.86 ± 0.61	87.54 ± 0.19	84.29 ± 0.41
`hmbyt5-preliminary/byt5-small-multilingual-4g-3e`	83.49 ± 0.99	87.38 ± 0.53	84.30 ± 0.51
`hmbyt5-preliminary/byt5-small-historic-multilingual-flax`	83.28 ± 1.67	86.98 ± 0.71	83.49 ± 1.06	76.96 ± 1.58	78.80 ± 1.89	86.47 ± 0.79	77.43 ± 0.51
`hmbyt5-preliminary/byt5-small-historic-multilingual-span20-flax`	84.91 ± 0.86	88.02 ± 0.35	84.78 ± 0.75	77.77 ± 1.83	79.94 ± 0.60	86.85 ± 0.91	77.45 ± 0.54

More recent results on more datasets can be found in the hmLeaderboard.

Acknowledgements

We thank Luisa März, Katharina Schmid and Erion Çano for their fruitful discussions about Historical Language Models.

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). Many Thanks for providing access to the TPUs ❤️