A Multi-lingual Gateway on the Web

Horizontal Rule [JUN 96]

[icon]

Ricky Chan

Introduction

Exploring the Internet in other parts of the world via World Wide Web (WWW) is fun. There may be times when you retrieve a document, you see strange characters all over the screen. This is because the character-encoding supported by the WWW client running on your desktop workstation is incompatible with the encoding scheme used by the WWW server in storing documents. Different regions support different encoding schemes. For instance, more than one internal code support Chinese language. Two most popular codes are GB for regions using simplified Chinese characters such as mainland China and BIG5 for regions using traditional Chinese characters such as Hong Kong and Taiwan. Other codes include Shift-JIS, JIS, EUC for Japanese and KSC for Koreans. In order for the users to browse documents with multiple encodings, multiple copies of documents in various character-encodings are stored in information server and/or the WWW client has to support multiple character-encodings. However, most of the existing operating systems support only one kind of encoding scheme, making the WWW clients difficult to support multiple character-encodings.

To alleviate this problem, CSC devises a new approach to overcome the limitations of the existing multi-lingual World Wide Web technology. Our idea is to construct a WWW gateway that supports the character-encodings of most of the WWW clients and information servers in the world. Consider a user who wishes to access a document with character-encoding which is not supported by the browser he/she uses. With our proposed WWW Gateway, the character-encoding can be converted from the server encoding to the client-supported encoding on the fly. Thus, users can view various documents with character-encodings not supported by their browsers.

How does it work?

Our proposed multi-lingual WWW Gateway runs on Sun Sparc 5 platform with Solaris 2.4 operating system. The diagram on the right illustrates the working principle of the Gateway. The Gateway consists of a PERL (Practical Extraction and Report Language) Common Gateway Interface (CGI) and several Unicode based code converters (written in C programming language).

[window]

Figure 1: Our proposed Unicode based Multi-lingual WWW Gateway

Suppose we have a user on campus who wants to access a Web page in China stored in GB character- encoding scheme. Assuming that his WWW browser supports only BIG5 character-encoding. Without using any tool, he/she will see strange characters on the screen, as shown in Figure 2 below:

[window]

Figure 2 : Output for viewing a Chinese GB Web page using a BIG 5 based browser

The sentence is meaningless due to misinterpretation of the character-encoding. Some codes may be common in both character-encoding schemes but represent different words. Our Gateway in Figure 1 can help solve this problem. Here is how it works. The user (WWW client, Step 1) fills in the HTML FORM http://cctpwww.cityu.edu.hk/public/chinese/computing/multilingual/multilingual_gateway.html in the WWW server as shown in Figure 3 below (Step 2).

[window]

Figure 3 : A partial layout of our designed HTML Form

Once the user clicks on the button "Load URL", the FORM issues a command "http://cctpwww.cityu.edu.hk/chinese-cgi/chinesecgi?url=X1&ccode=X2&scode=X3&doctype=X4". The meaning of each parameter is summarized in Table I below:

Parameter Name Description Default
url=X1 The location of the file in Internet The url is the short form of Universal Resource Locator. Note: currently, only http is supported. url=http://cctpwww.cityu.edu.hk/
ccode=X2 Character-encoding of the Client Supported character-encodings: Big5, GB, JIS and Unicode. ccode=Big5
scode=X3 Character-encoding of the server Supported character-encodings: Big5, GB, JIS and Unicode. scode=Big5
doctype=X4 The MIME content type of the file Supported content types: text/html, text/plain. doctype=text/html

Table I : The parameters accepted by our proposed Unicode based WWW gateway

The user-supplied information is interpreted by our Gateway (Step 3). The WWW server then retrieves the document from the specified document location with the given server character-encoding from the information server (Step 4) and passes it back to the Gateway (Step 5). The content will be modified if necessary. To adopt the character-encoding of the WWW client, the document is converted to client character-encoding format using our Unicode based code converters (Step 6). The converted document is then passed to the WWW client for viewing as shown in Figure 4 below (Step 7). The conversion process is completely transparent to the user. He/she merely fills in a form and our Gateway will then perform the necessary steps to convert the Web page into readable characters on the spot.

[window]

Figure 4 : The result of browsing a GB document using a BIG5 WWW browser through our gateway

Conclusion

With our developed Gateway, users are no longer bounded by the character-encoding supported by their browsers. They can gain access to a greater pool of information resources. Moreover, a World Wide Web server can provide multiple character-encodings information while storing only one copy of documents, thus saving a lot of storage space. If a new character-encoding is supported, there is no need to create a new copy of documents using the newly supported encoding. Security check can also be performed in our WWW Gateway if required.

Our development is nowhere near the end. More research will be done on addressing the following issues :

[Issue No. 7]


[u logo]
Computing Services Centre
City University of Hong Kong
ccnetcom@cityu.edu.hk

[Home Page][CSC Home][NetComp Home][Content Home][Previous Page][Next Page]