MatPlus.Net

Website founded by
Milan Velimirović
in 2006

1:26 UTC

ISC 2024


	Headlines
	Forum*
	Fellows
	Members
	DL Archive
	Links

CHESS
SOLVING
Tournaments
Rating lists
1-Apr-2024

B P C F

MatPlus.Net

Forum

Feedback by Members

UNICODE - Does not work for me?

You can only view this page!

(1) Posted by Miodrag Mladenović [Friday, Sep 8, 2006 22:07]

UNICODE - Does not work for me?

Hi Milan,

I tried to use some special characters defined by UNICODE but for some reasons I am running into problems. I can type message and I see everything OK but as soon as I hit preview button it shows different font that does not support UNICODE. Not a big deal for now but it will be nice if this can be fixed so that everyone can type names properly.

Misha.

(Read Only)

pid=8

(2) Posted by Eric Huber [Saturday, Sep 9, 2006 02:09]

Regarding Misha's observation, I have noticed something: using Internet Explorer, if I select View...Encoding...Central European (Windows), I can see the " c' " character at the end of your name, Mladenovi�.
If the usual Western European (ISO) encoding is selected, instead of the " c' " I can read an "ae" (Latin Small Letter ae according to Character Map).
So, is it only a matter of encoding?

(Read Only)

pid=10

(3) Posted by David Knezevic [Saturday, Sep 9, 2006 02:11]

It is not (at least for me) as simple as it may seem. Web application is in many ways different from program running locally. Data is transferred from client to server and vice versa and different applications take part in data entry/presentation on client side and data processing on server side. Support for languages and Unicode requires a considerable efort and time which I am lacking now. Some day...

Nevertheless, there is some Unicode support in case you need language-specific characters so badly. It is possible to enter the Unicode character codes in HTML form: &#nn;, where nn stands for Unicode decimal character code.

Example: Miodrag Mladenović - bolded letter is entered as ć.

There is a sufficient information on Internet about HTML character codes, for instance:
http://en.wikipedia.org/wiki/List_of_HTML_decimal_character_references

It is possible to embed some HTML formatting codes, but more information about it will be available when I finish help documentation.

(Read Only)

pid=11

(4) Posted by David Knezevic [Saturday, Sep 9, 2006 02:20]

And regarding Eric's post: I am not enough familiar with language-specific issues, so before I do anything in implementing it I need to read some documentation...

And one degression: While I was writing the answer to Misha's post, Eric's post arrived. Porhaps every answer should carry the link to which earlier post it refers to, as well as the referred post should contain the link to the answer. Otherwise the discussion may be somehow confusing: although all posts are under same topic, the flow of discussion may follow different pathes. Opinions?

(Read Only)

pid=12

(5) Posted by Eric Huber [Saturday, Sep 9, 2006 03:28]

An answer to Milan's last digression:
You have the following (and probably not exhaustive) choice:
- either, as you suggested, you let every answer carry the link to the earlier post it refers to, as well as the referred post should contain the link to the answer;
- a way to solve that little problem without further programming is to use a convention: if you wants to answer Someone's question during the thread, you write '@Someone' as the beginning of your answer, so that everyone reading it knows which subject you're dealing with. Following that convention, I would have written @Milan at the beginning and that would have caught your attention.

(Read Only)

pid=13

(6) Posted by Iļja Ketris [Sunday, Sep 10, 2006 12:28]

The simplest solution is to switch the whole site to Unicode, specifically, UTF-8.
What we see now on page is:

<head>
<title>MatPlus.Net</title>
</head>

There is no tag in the head section, and, there is no encoding information in server reply:

[ilya@shiva ~]$ telnet www.matplus.net 80
Trying 194.106.188.116...
Connected to www.matplus.net.
Escape character is '^]'.
GET / HTTP/1.0

HTTP/1.1 200 OK
Date: Sun, 10 Sep 2006 09:16:59 GMT
Server: Apache/2.0.53 (Unix) mod_ssl/2.0.53 OpenSSL/0.9.7d DAV/2 PHP/4.3.11
Last-Modified: Sun, 06 Mar 2005 20:11:40 GMT
ETag: "168002-3b-9663f700"
Accept-Ranges: bytes
Content-Length: 59
Vary: Accept-Encoding,User-Agent
Connection: close
Content-Type: text/html

Prezentacija je u fazi azuriranja. Molimo probajte kasnije.Connection closed by foreign host.

That means that the pages will be interpreted is in default encoding, namely, ISO-8859-1, also known as "Latin-1".
When we try to enter character outside of this repertoire, including Milan's "ć" and my "ļ", they get encoded by forum software as HTML entities, as was correctly noted.

While being acceptable, this may present problems with converting posts to text format, taking care of preview functionality etc. In a word, the authors/maintainers must be VERY aware of these encoding conversion taking place all the time. Besides, it is not very economical -- if I write in cyrillics, every symbol takes as many as seven bytes: "ч", as opposed to two to three in UTF-8.

The most efficient and universal way of dealing with that is switching the forum (and the whole site) to UTF-8 altogether. All it takes is adding a directive in the page headers, such as <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />, and the browsers will interpret everything in UTF-8, therefore most of our special characters will be valid HTML, requiring no recoding.

Anothing thing that I just found -- the forum interprets HTML, preventing from entering easily characters like "<" and ">".

(Read Only)

pid=30

(7) Posted by Harry Fougiaxis [Sunday, Sep 10, 2006 14:01]

Not directly relevant to the topic, I know, but if someone would like to get a quite comprehensive e-book about Unicode, download "Unicode Explained" by Jukka Korpela, O'Reilly editions, June 2006. You need WinRAR to extract the archive.

http://rapidshare.de/files/32605705/oreilly_-_unicode_explained_jun_2006.rar

(Read Only)

pid=31

(8) Posted by David Knezevic [Sunday, Sep 10, 2006 16:34]

Thanks Ilja, the advices from person with your experience and knowledge are valuable. I admit that I still have a lot to learn, but I am a good in it!

I added the utf-8 directive to the header of (these) pages - I am aware that this will not solve all the problems with Unicode. For instance, now we see a box instead of c acute and for sume reason the "<" following is "eaten" by IE6 (Netscape 6 still shows question mark and does not eat the tag openning) - see the caption of first post in this thread.

(Quotes are to be implemented, so description: about interpreting HTML tags)
I left intentionally HTML tags unchanged because, at the moment, that seems to me the best way to enable some text formatting (of course, one has to know to use them). I still have to write (or find) a text formatting routine to replace a primitive one used currently. The text conversions are by no means trivial problem. Even HTML entities must be handled carefully, i.e. converted before initializing the TEXTAREA for edit.

(Read Only)

pid=32

(9) Posted by David Knezevic [Sunday, Sep 10, 2006 17:04]

... but I just changed the Miodrag Mladenovic's name in registration base, so the 'eaten' > can be seen elsewhere, not in caption

(Read Only)

pid=33

(10) Posted by Miodrag Mladenović [Sunday, Sep 10, 2006 17:11]

It looks like this is a good solution for this problem. Now I see all characters properly even in preview mode. For example I typed:

Миодраг Младеновић (my name in Cyrillic letters)
Miodrag Mladenović (my name in Latin letters)

and everything looks good.

(Read Only)

pid=34

(11) Posted by Administrator [Sunday, Sep 10, 2006 17:17]

... so I changed your name back to database Mladenović :)

Thanks to Ilja again - a bit of knowledge is worth more than weeks of hard work!

(Read Only)

pid=35

(12) Posted by Iļja Ketris [Monday, Sep 11, 2006 00:05]

To be properly displayed in the forum pages, which are now in UTF-8, "ć", and anything else, must be encoded in UTF-8, surprisingly :)

Miodrag's "ć" in the registration database was probably entered in some single-byte encoding, most likely CP-1250 (modification of ISO-8859-2 by infinite wisdom of Microsoft®), which has the code of single byte value of 0xE6.

UTF-8 uses some redundant mechanism of encoding characters in entities of variable length, from one to six bytes, designed to be robust against loss of sequence in the middle of character, so that the system always know where charater starts.

0xE6 is not a valid beginning of character in UTF-8, so some rendering mechanisms (as in IE) choose to ignore it altogether, while others (as in Mozilla-derived engines) are a little more informative, and present a question mark to flag a character outside of the declared repertoire.

This is quite simple, really, and doesn't take reading a book first. The key is to use UTF-8 everywhere, so no conversion will ever be needed. In case you need to convert some legacy text in other encodings, there is plenty of tools for that. In Un*x there is an "iconv" command, in other operating systems your text processor will have an option of asking the encoding of text being imported, all you have to do is to guess right.

(Read Only)

pid=37

No more posts

MatPlus.Net

Forum

Feedback by Members

UNICODE - Does not work for me?