I have been hard at work behind the scenes at Wokai laying the groundwork for what will hopefully be a flexible and comprehensive multilingual website. So this week I thought I'd share some of the details that may be helpful to other philanthropic hackers out there (money-grubbing techies please hide your eyes). I'm writing this blog post while sitting on the tarmac at Beijing's airport … waiting for the "Beijing fog" to lift. This exercise in fantasy may take some time. Eventually I expect to reach the island that is South Korea Incheon Airport - and will roam from one wifi hotspot to the next… so please attribute any disorganization in this article to on-the-job environmental hazards.
General UTF-8 Configuration
First up, a preliminary yet seemingly endless task of configuring an appropriate character encoding for the platform. UTF-8 is the most common choice, and is therefore the best choice unless you truly need the two-byte UTF-16 range. As most programmers know, what makes configuring a character set difficult is simply the multitude of places where such configuration exists. Any place in your system where a conversation occurs between two distinct pieces of software deserves inspection and may come with its own charset configuration properties or quirks (i.e., bugs that haven't yet been realized). Furthermore, there are wide sections of overlapping configuration where one layer can override another to make things confusing. Below are a few of the basics, though sometimes overlooked.
Database Variables -- Define UTF-8 as defaults for the MySQL server, connecting client, and command-line client. Also note the "init_connect" property to automatically apply a default character set on all connections.
[client] default-character-set=utf8 [mysqld] init_connect='SET collation_connection = utf8_general_ci' init_connect='SET NAMES utf8' default-character-set=utf8 character-set-server=utf8 collation-server=utf8_general_ci [mysql] default-character-set=utf8
Database Tables -- Even with the above configuration, you should still check to be sure all existing tables are defined as UTF8.
CREATE TABLE `partnerPostings` (
…
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Database Connection -- Connecting clients such as JDBC can also configure properties directly.
dbUri=jdbc:mysql://127.0.0.1:3306/wokai?
useUnicode=true
&characterSetResults=UTF-8
&characterEncoding=UTF-8
<%@ page contentType="text/html;charset=UTF-8" language="java" %>
CharsetEncoding Servlet Filter
When browsers connect with the web server, their request is also encoded in a charset. Often, the browser is unaware of what type of content is being passed in HTML forms or URLs. Crudely forcing the character encoding of all inbound HTTP requests has now become a commonplace measure of protection - and prevents having to bite of more complex decision making. In a Java environment, this is easily done by creating a Servlet Filter such as the following:
public class CharsetEncodingFilter implements Filter
{
FilterConfig fc;
public void doFilter(ServletRequest req, ServletResponse res,
FilterChain chain) throws IOException, ServletException
{
req.setCharacterEncoding("UTF-8");
res.setContentType("text/html; charset=UTF-8");
chain.doFilter(req, res);
}
public void init(FilterConfig filterConfig)
{
this.fc = filterConfig;
}
public void destroy()
{
this.fc = null;
}
}
And configuring the filter in your web.xml:
CharsetEncodingFilter org.wokai.CharsetEncodingFilter
Stripes Custom Locale Picker
Whatever web MVC framework you're using, you'll likely want to configure a custom locale picker to provide some extra assurance that whatever localities are encountered in an HTTP requested are brought down to the list of localities supported by your web application. Additionally, you'll want to acknowledge a web user's request to override the default locale. I do this by setting a value in the session. With Stripes, creation of a custom locale picker is quite simple - and the results are applied to the HttpServletRequest so any calls to getLocale() will abide. Very nice.
public class CustomLocalePicker extends DefaultLocalePicker
{
public final static Locale CHINESE = new Locale("zh", "CN");
public final static Locale ENGLISH = new Locale("en", "US");
@Override
public Locale pickLocale(HttpServletRequest request)
{
Locale locale = super.pickLocale(request);
HttpSession session = request.getSession(false);
String sessionLocale = null;
if (session != null)
{
sessionLocale = (String) session.getAttribute("locale");
if (sessionLocale != null)
{
if (sessionLocale.equalsIgnoreCase("chinese"))
{
return CHINESE;
}
return ENGLISH;
}
}
if (locale.getCountry().equals("CN") ||
locale.getLanguage().equals("zh"))
{
return CHINESE;
}
return ENGLISH;
}
}
And configuring this custom locale picker into Stripes…
LocalePicker.Locales
en_US:UTF-8,zh_CN:UTF-8
LocalePicker.Class
org.wokai.CustomLocalePicker
Message Resource Bundles
Now that we know what language we want to present, it is time for the translation meat. Tokenizing your entire website (disassembling all the text into succinct phrases that can ideally be reused) is a most tedious process. As most don't take their website bilingual until after it is well endowed with content, this is by far the most time consuming process. Using Java properties files to record the mappings is very straight forward, but will require some strategy on naming conventions. In some cases reusability is key, in others quarantining the text to a specific page is necessary. Over-thinking this may also be hazard. Simplicity is paramount as you'll be handing these properties files off to non-programming types or 3rd party services which may very well ignore the hard-thought naming conventions you used. (One such service that provides collaborative translation of message resource bundles is Crowdin.net.)
If you haven't used property files for multilingual text before, you might be surprised to find that the charset is hard-coded to ISO-885920372319. (Okay I'm exaggerating a little). To be clear, Java's support for multilingual text rests on top of a non-unicode charset. Stupid, but only because Sun has preferred to leave the embarrassment on display for a decade instead of adding in the easy fix. Nonetheless, here's the hack for getting around this problem… writing UTF8 to the properties file and then converting it into ISO-8859 using "native2ascii". One can also write a custom resource bundle implementation to migrate away from property files altogether, but I've not found a significant motivator for this yet.
#!/bin/sh
native2ascii -encoding utf8 resOrig.properties > res.properties
native2ascii -encoding utf8 resOrig_zh_CN.properties > res_zh_CN.properties
Applying Localized Text
We can now present localized text on web pages with a common JSP format tag like the following:
<%@ taglib prefix="fmt" uri="http://java.sun.com/jsp/jstl/fmt" %>
...
Custom Locale Include Tag
While the above "fmt:message" tag handles most of the localization work, there are other cases where it may be desirable to create more than one version of a complete web page - or include localized fragments. For this purpose, I created a custom JSP tag that will include JSP page fragments with proper request forwarding.
public class LocaleIncludeTag extends BodyTagSupport
{
private static final long serialVersionUID = 1L;
protected static Logger logger_ = Logger.getLogger(LocaleIncludeTag.class);
private String defaultPath_;
public LocaleIncludeTag()
{
super();
}
public void setDefaultPath(String defaultPath)
{
this.defaultPath_ = defaultPath;
}
@Override
public int doStartTag() throws JspException
{
try
{
JspWriter out = this.pageContext.getOut();
ServletContext context = this.pageContext.getServletContext();
ServletRequest request = this.pageContext.getRequest();
ServletResponse response = this.pageContext.getResponse();
Locale locale = request.getLocale();
String base = this.defaultPath_;
String type = null;
String path = this.defaultPath_;
int idx = this.defaultPath_.lastIndexOf(".");
if (idx > 0)
{
base = this.defaultPath_.substring(0,idx);
type = this.defaultPath_.substring(idx);
}
if (type != null)
{
path = base + "_" + locale.toString() + type;
}
else
{
path = base + "_" + locale.toString();
}
// REVERT TO DEFAULT IF LOCALE_FILE NOT FOUND
if (context.getResource(path) == null)
{
path = this.defaultPath_;
}
RequestDispatcher rd = request.getRequestDispatcher(path);
rd.include(request, new ServletResponseWrapperInclude(response, out));
}
catch (Exception e)
{
throw new JspTagException("exception: " + e.getMessage());
}
return SKIP_BODY;
}
@Override
public int doEndTag() throws JspException
{
return EVAL_PAGE;
}
}
Additional Steps
So far, I've outlined a fairly standard approach to bilingual website development - with a few customizations for added flexibility and reliability. An implementation for retrieving localized text from the database is also needed. For Wokai, I chose to build a simple JSP helper tag to select object properties based on locality. Wokai isn't at this time planning on supporting unlimited languages, but if that were the goal you would need to externalize localized resources into a normalized table structure and build a service to feed from that.
Next on my to-do list is to add a layer of URL redirection for SEO optimization. Web crawlers run from all parts of the world and, depending on their configuration, may be presented with different versions of the website per each visit… or at a minimum will be unaware of the alternate versions available. To fix this problem, I plan to prefix all URLs with the desired locality. The prefix will be processed, stripped from the URL and the request forwarded. Of course this also requires all links on the website to undergo processing for including the required prefix. A good deal of work, but with the right tools it is quite manageable. Perhaps we'll leave this for a blog post on another day.