Pages

Saturday, 4 August 2012

Rendering letters with umlauts etc - possibly more than you wanted to know about Latvian orthography

The edit log 

14 October 2019 - I've finally found out what this phenomenon is called - Mojibake. Plent of detail in that Wikipedia page, scroll down for English and other Western European languages to see what we're up against. If I've understood correctly it arises when text encoded in the UTF-8 character set is misinterpreted as characters from the ISO-8859-1 set. I've also added the ö after discovering "Björk"

5 November 2015 - Tom Morris has created an online tool that lets you type in gibberish, like Ropažu  and it will return the correct form, ie Ropažu

15 November 2014 - now with added – - which codes for a hyphen.

13 September 2012 - new bit on Morse code


The actual post
Part of my job at the moment involves transferring the names and addresses of people who want to subscribe to our schools magazines (cs4fn, ee4fn and Audio4fn) onto an Excel spreadsheet / database. Generally this is a bit of cutting and pasting, making sure things are appropriately formatted and filling in gaps. There are plenty of addresses from India and the US and I'm learning quite a bit about cities and states in those countries, and of course zip codes.

There's recently been a crop of addresses coming in from Latvia and Hungary and I've been surprised that some of the words have been completely unreadable. I know that those languages have extra letters with diacritical marks on them (ümlåuts and things of that îlk) but those letters should show up fine on screens and 'come through' OK from the online form. What comes through is completely unreadable, but it seems to have a consistent format.

Each mystery letter is rendered by a capital letter A with a diacritic (eg Ä) and then another character (
«) forming the combination Ä« (which turns out to be ī as in Līgatnes and not LÄ«gatnes)

At first I thought they'd just be vowels and that someone whose city name might be say London might not mind desperately if I rendered it as Londan. Then I realised, after looking at Latvian orthography on Wikipedia and the letters of the Hungarian alphabet, that c, s, n, k and other letters are involved. You can get some wiggle room on vowels but less so on consonants - and the length of the word and percentage of correct letters will have an impact too.

Because of the consistency I assumed that there'd be a standard gloss available that told you which letters to replace the mystery characters with, after all I can't be the first person to be faced with BÄ“rziņš and not know immediately what to do with it (it turned out to be Bērziņš). [Edit, found this gloss-like thing which may be useful: UTF-8 character debug tool]


It's really quite difficult to search just for the Ä“ character by itself, Google hasn't the faintest idea what to do with it. For some of the longer names the word did come up in Google - quite amusingly - and in some cases the Latvian version was also available from which I could work out what the letters meant.

I was also surprised that 'place names in Latvia' brought up this detailed resource which has helped me home in on likely suspects. In most cases I've had to 'triangulate' (I love that word) information from more than one source. Thanks too to @helga_j who was able to work out a word I was particularly struggling with, which opened up other words for me (once you've got another letter it's easier to search the place names database).

I am really quite amazed though that there isn't already a resource where you can type in your mystery characters and it tells you the correct letter. So I've decided to start one, in the sense that you can read off the list below and find your mystery letter. There will be a few more to add as I continue to do this, I've not solved all of them yet. 


This is probably the closest I'll ever get to working on cryptography type things at Bletchley Park. I really wasn't expecting to enjoy this as much as I have done but I hope others find it useful if they ever search for mystery names.

I'm assuming that these characters are consistent across all languages and that all computers would render Ä“ as ē - could be wrong.


Probable 'translations' of characters into letters
Copy and paste into notepad / plain text, or re-size


Mystery letter
Likeliest real letter
Ä
ā
ļ
ļ
Ä“ / Ä—
ē
Ä«
ī
Ä·
ķ
Å¡
š
ņ
ņ
Å«
ū
ž
ž
Å„
ń
Ã
í
á
á
ó
ó
ö
ö
ê
ê
é
é




Ä
Ä might be ā


ļ = ļ
ZÄ«
ļu = Zīļu



Ä“ / Ä— = ē
JÄ“kabpils = Jēkabpils
BÄ“rziņš = Bērziņš

Ä« = ī
KuldÄ«ga = Kuldīga
LÄ«gatnes = Līgatnes

Ä· = ķ
BiÄ·ernieku = Biķernieki or Bikernieki

Å
Å¡ = š
Bērziņš
= Bērziņš

ņ = ņ
BÄ“rziņš = Bērziņš


Å« = ū
mÅ«zikas (in context = music), so I searched what the Latvian word was for music = mūzika
LÅ«kina = Lūkina

ž = ž
Ropažu = Ropažu, in English = Ropaži

Å„ = ń
Powstańc
ów = Powstańców


Ã
í = í
Nyíregyháza = Nyíregyháza

á = á
Nyíregyháza = Nyíregyháza


ó = ó
József = József 



ö = ö
Björk =
Björk


ê = ê

Ciências = Ciências


é = é
route de l'aéroport = route de l'aéroport



http://imgs.xkcd.com/comics/encoding.png
Encoding

 

Morse code

Edit: 13 September 2012
I have discovered, after visiting the Orkney Wireless Museum that people bothered to include accented letters in Morse code exchanges, or at least made it possible to use some of them - I just took a snap of this and didn't investigate any more to find out how commonly they were used. I'd have assumed that diacritical marks would have been the first casualty of war* but perhaps people set much greater store by them than I do (speaking as someone who has no accented letters in my name and assumes you can kind of survive without them). I am rather delighted that they are included though :)

*Obviously Morse code was used as a communication tool under many circumstances, not just war.
 
Morse code key (as in explanation)
Orkney Wireless Museum, taken in September 2012




–
Edit: 15 November 2014
While looking at some text about a road accident in 1863 I spotted a lot of – characters in it which are obviously meant to be some punctuation mark, but which? 


"on the evening of 26th December I was at Mr. John's, Mile-end-road, and saw the prisoner—he was racing; he was in a cart and horse—there was another cart alongside of him—this was in the Mile-end-road; Globe-road crosses the Mile-end-road—I saw an old gentleman knocked down in Globe-road by the horse the prisoner was driving—there is a crossing there across the road—there were not many persons about—the two horses and carts were racing together; one was trying to get before the other—it was pretty light; it was about 5 o'clock, between 5 and 6."

Delightfully, googling for the characters actually brought up quite a lot of stuff, including this gloss which I've adapted from Mark McBride's blog post.

… = ...
– = –
’= ’
“ = “

Edit: 18 November 2014
I've just had another example of this, with "Connah's Quay" which rendered on-screen as "Connah’s Quay".




No comments:

Post a Comment

Comment policy: I enthusiastically welcome corrections and I entertain polite disagreement ;) Because of the nature of this blog it attracts a LOT - 5 a day at the moment - of spam comments (I write about spam practices,misleading marketing and unevidenced quackery) and so I'm more likely to post a pasted version of your comment, removing any hyperlinks.

Comments written in ALL CAPS LOCK will be deleted and I won't publish any pro-homeopathy comments, that ship has sailed I'm afraid (it's nonsense).