I received a very interesting hospital dataset recently, in excel format and containing some basic variable names and values in Japanese. These included the sex of the patient, the specialty under which they were admitted to hospital, and all variable names. Initially this would be reasonably easy to convert to English in excel before import, but it would require making a pivot table and fiddling a bit (my excel-fu) is a bit rusty, but also I have address data and though at this stage it’s not important it may be in the future. So, at some point, I’m going to have to import this data in its Japanese form, so I figured I should work out how to do it.
The problem is that a straight import of the data leads to garbled characters, completely illegible, and very little information appears to be available online about how to import Japanese-labeled data into Stata. A 2010 entry on the statalist suggests it is impossible:
Unfortunately Stata does not support Unicode and does not support other multi-byte character sets, such as those necessary for Far Eastern Languages. If you are working with a data set in which all of the strings are in a language that can be represented by single byte characters (all European languages) just choose the appropriate output encoding. However, if your dataset contains strings in Far Eastern langages or multiple languages that use different character sets, you will simply not be able to properly represent all of the strings and will need to live with underscores in your data.
This is more than a little unfortunate but it’s also not entirely correct: I know that my students with Japanese operating systems can import Stata data quite easily. So I figured there must be something basic going wrong with my computer that was stopping it from doing a simple import. In the spirit of sharing solutions to problems that I find with computers and stats software, here are some solutions to the problem of importing far Eastern languages for two different operating systems (Windows and Mac OS X), with a few warnings and potential bugs or problems I haven’t yet found a solution for.
Case 1: Japanese language, Windows OS
In this case there should be no challenge importing the data. I tried it on my student’s computer: you just import the data any old how, whether it’s in .csv or excel format. Then in your preferences, set the font for the data viewer and the results window to be any of the Japanese-language OS defaults: MS Mincho or Osaka, for example.
This doesn’t work if you’re in an English language Windows, as far as I know, and it doesn’t work in Mac OS X (this I definitely know). In the latter case you are simply not able to choose the Japanese native fonts – Stata doesn’t use them. No matter what font you choose, the data will show up as gobbledigook. There is a solution for Mac OS X, however (see below).
Case 2: English language, Windows OS
This case is fiddly, but it has been solved and the solution can be found online through the helpful auspices of the igo, programming and economics blogger Shinobi. His or her solution only popped up when I did a search in Japanese, so I’m guessing that it isn’t readily available to the English language Stata community. I’m also guessing that Shinobi solved the problem on an English-language OS, since it’s not relevant on a Japanese-language OS. Shinobi’s blog post has an English translation at the bottom (very helpful) and extends the solution to Chinese characters. The details are on Shinobi’s blog but basically what you do is check your .csv file to see how it is encoded, then use a very nifty piece of software called iconv to translate the .csv file from its current encoding to one that can be read by Stata: in the example Shinobi gives (for Chinese) it is GB1030 encoding, but I think for Japanese Stata can read Shift-JIS (I found this explained somewhere online a few days ago but have lost the link).
Encoding is one of those weird things that most people who use computers (me included!) have never had to pay attention to, but it’s important in this case. Basically there are different ways to assign underlying values to far Eastern languages (this is the encoding) and although excel and most text editors recognize many, Stata only recognizes one. So if you have a .csv file that is a basic export from, say, excel, it’s likely in an encoding that Stata doesn’t recognize on an English-language OS. So just change the encoding of the file, and then Stata should recognize it.
Working out what encoding your .csv file is currently in can be fiddly, but basically if you open it in a text editor you should be able to access the preferences of the editor and find out what the encoding is; then you can use iconv to convert to a new one (see the commands for iconv in Shinobi’s blog).
Unfortunately this doesn’t work on Mac OS X: I know this, because I tried extensively. Mac OS X has iconv built in, so you can just open a terminal and run it. BUT, no matter how you change the encoding, Stata won’t read the resulting text file. You can easily interpret Shinobi’s solution for use on Mac but it won’t work. This may be because the native encoding of .csv files on Mac is unclear to the iconv software (there is a default “Mac” encoding that is hyper dodgy). However, given the simplicity of the solution I found for Mac (below), it seems more likely that the problem is something deep inside the way Stata and the OS interact.
Case 3: English-language, Mac OS X
This is, of course, something of a false case: there is no such thing as a single-language Mac OS X. Realizing this, and seeing that the task was trivial on a Japanese-language Windows but really fiddly on an English-language windows, it occurred to me to just change the language of my OS (one of the reasons I use Apple is that I can do this). So, I used the language preferences to change the OS language to Japanese, and then imported the .csv file. Result? Stata could instantly read the Japanese. Then I just switched my OS back to English when I was done with Stata. This is a tiny bit fiddly in the sense that whenever you want to work on this file you have to switch OS languages, but doing so on Apple is really trivial – maybe 3 or 4 clicks.
When you do this though, if you aren’t actually able to read Japanese, you’ll be stuffed trying to get back. So, before you do this, make sure you change your system settings so that the language options are visible on the task bar (you will see a little flag corresponding to your default locale appear next to the date and time). Then, make sure you know the sequence of clicks to get back to the regional language settings (it’s the bottom option of the language options menu in your taskbar, then the left-most tab inside that setting). That way you can change back easily. Note also that you don’t, strictly speaking, have to change the actual characters on the screen into Japanese! This is because when you select to change your default OS language, a little window pops up saying that the change will apply to the OS next time you log in but will apply to individual programs next time you open them. So you can probably change the OS, open Stata, fiddle about, close Stata, then change the OS back to English, and so long as you don’t log out/restart, you should never see a single Japanese-language menu! Weird, and kind of trivial solution!
A final weird excel problem
Having used this trick in Mac OS X, I thought to try importing the data from its original excel format, rather than from the intermediate .csv file. To my surprise, this didn’t work! In programming terms, running insheet to import .csv files translates the Japanese perfectly, but running import to import the excel file fails to translate properly! So, either there is something inaccessible about excel’s encoding, or the import program is broken in Stata. I don’t know which, but this does mean that if you receive a Japanese-language excel file and you’re using Mac OS X, you will need to export to .csv before you import to Stata. This is no big deal: before Stata 12, there was no direct excel import method for Stata.
A few final gripes
As a final aside, I take this as a sign that Stata need to really improve their support for Asian languages, and they also need to improve the way they handle excel. Given excel’s importance in the modern workplace, I think it would be a very good idea if Microsoft did more to make it fully open to other developers. It’s the default data transfer mechanism for people who are unfamiliar with databases and statistical software and it is absolutely essential that statisticians be able to work with it, whatever their opinions of its particular foibles or of the ethics of Microsoft. It also has better advanced programming and data manipulation properties than, say, OpenOffice, and this makes it all the more important that it match closely to standards that can be used across platforms. Excel has become a ubiquitous workplace tool, the numerical equivalent of a staple, and just as any company’s staplers can work with any other company’s staples if the standards match, so excel needs to be recognized as a public good, and made more open to developers at other companies. If that were the case I don’t think Stata would be struggling with Asian-language excel files but dealing fine with Asian-language .csv files.
And finally, I think this may also mean that both Apple and Microsoft need to drop their proprietary encoding systems and use an agreed, open standard. And also that Windows need to grow up and offer support for multiple languages on all their versions of Windows, not just the most expensive one.
Lastly, I hope this post helps someone out there with a Japanese-language import (or offers a way to import any other language that has a more extensive encoding than English).