|
nk4um Moderator
Posts: 485
|
2010-02-04T11:36:48.000ZFebruary 4, 2010 11:36Fixed
Hi Miles/Menzo,
Menzo, I like your solution as an application level workaround but unfortunately it won''t work in the infrastructure of the
HTTP bridge because I can''t rely on any detectable preset value on which to base an encoding judgment.
So what I''ve done is but in a fix to the HTTP bridge which assumes that the post parameters are UTF-8 encoding but allow
this to be overridden on a per-transport basis with the HTTP Bridge config. i.e.
<config> <defaultPostParamEncoding>ISO-8859-1</defaultPostParamEncoding> </config>
I believe UTF-8 is the most command and sensible default because by default NetKernel encodes HTML pages as UTF-8. GET parameters
are unaffected by this change.
This fix will be out in the next release.
Cheers for all your inputs, Tony
|
|
nk4um User
Posts: 89
|
2010-02-04T09:22:34.000ZFebruary 4, 2010 09:22
I think a problem with this approach is that the browsers aren''t consistent (as far as I remember). So sometimes everything
goes fine and you end with the correct UTF-8 string. If you then do this recode from ISO-8859-1 to UTF-8 by default, you''ll
be actually mangling the string yourself. So you need a way to detect if the string actually needs recoding. I don''t think
we want to enter the world of browser sniffing, hence my solution of checking a known string. But maybe the recently open
sourced ProXML Encoding Validator could also help? I haven''t had the time to play with it ...
Menzo
|
|
nk4um User
Posts: 24
|
2010-02-03T23:44:49.000ZFebruary 3, 2010 23:44
I''m not sure why you put the test for ''post'' inside the loop, where it gets executed many times rather than outside, e.g.: if (request.getMethod().equals(''POST'') { request.parameterValues.each { ..... } } Problem is, (either way) if I put this in, it will break supposing the bug ever gets fixed. Thanks, though. -= miles =-
for (String value: mRequest.getParameterValues(name) ) { if (mRequest.getMethod().equals("POST")) { byte[] b2=value.getBytes("ISO-8859-1"); value=new String(b2, "UTF-8"); } }
|
|
|
nk4um Moderator
Posts: 756
|
Looks like there has been a consistent inconsistency in what Jetty thinks is the default encoding... http://jira.codehaus.org/browse/JETTY-1153?page=com.atlassian.streams.streams-jira-plugin%3Aactivity-stream-issue-tabThat thread relates to Jetty 6, I''d therefore assume that it holds for Jetty 5 (used with NK3). Browsers are supposed to return the form with encoding the same as the page encoding - so if you serve UTF-8 you should receive
UTF-8 and this should be expressed in the Content-Type header. Jetty does its best and always uses the character encoding
when its provided as an optional field on the Content-Type header. So if you''re doing a POST from a non-Browser client -
then you need a header like... Content-Type: application/x-www-form-urlencoded; charset=utf-8 My vote is that UTF-8 is assumed as the default - it solves so many problems for apps to standardize on this.
|
|
nk4um Moderator
Posts: 485
|
2010-02-03T21:37:18.000ZFebruary 3, 2010 21:37
Ok, made progress. It seems that current Jetty versions treat the default encoding to be ISO-8859-1 as Menzo suggested. HTTPServletRequest.getCharacterEncoding()
returned null on the forms I tried so I explicitly set accept-charset="UTF-8" on the HTML form element but still I get a null
on this field and the same result. So I tried this hack which fixes the problem.
for (String value: mRequest.getParameterValues(name) ) { if (mRequest.getMethod().equals("POST")) { byte[] b2=value.getBytes("ISO-8859-1"); value=new String(b2, "UTF-8"); } }
|
I''m not sure how safe this is to do in general but it appears to sort out the problem. It seems Jetty is incorrectly constructing
the string from the bytestream with the wrong charset. I was testing this with Firefox 3.5.7. Anybody got any thoughts? Cheers, Tony
|
|
nk4um Moderator
Posts: 485
|
2010-02-03T21:09:24.000ZFebruary 3, 2010 21:09
Cheers Menzo, we had assumed this encoding was all working because you had used all the unusual characters of the world! Little
did we know you''d implemented a workaround. ;-)
I''ve just been debugging and it appears that the problem is either inside Jetty or the way we configure Jetty. For GETs as
Miles says, you get the correctly encoded characters. But for POSTS they are getting mangled.
I''ll look into this some more.
|
|
nk4um User
Posts: 89
|
|
|
nk4um User
Posts: 89
|
2010-02-03T09:40:14.000ZFebruary 3, 2010 09:40
Hi, Miles, I''ve also encountered this problem in the past. Since 2004 I''m using an accessor which implements the following trick. In
all my forms I''ve a hidden input with a fixed value, e.g., ''Karo (Arrára)'' (which happens to be the language name for
which I first found the problem, and which gets mangled by the forum ;-) ). All the NVP documents that I receive from a client
are run through this accessor:
<instr> <comment>Get and fix the parameters</comment> <type>sloot.fixEncoding</type> <operand>this:param</operand> <operator> <encoding> <source>ISO-8859-1</source> <destination>UTF-8</destination> <check> <xpath>/nvp/encoding-check</xpath> <value>Karo (Arrára)</value> </check> </encoding> </operator> <target>var:param</target> </instr>
|
So the hidden variable is called encoding-check. If the encoding check fails the accessor will assume the encoding was wrongfully ISO-8859-1 while it should have been UTF-8. And it fixes the encoding by serializing the document back to ISO-8859-1, and the reading it back in as UTF-8.
StringWriter sw = new StringWriter(); operand.serialize(sw,false);
String src = sw.toString();
//System.err.println("DBG: fix encoding ["+src_encoding+"] source["+src+"]");
String dst = new String(src.getBytes(src_encoding),dest_encoding);
//System.err.println("DBG: fix encoding ["+dest_encoding+"] destination["+dst+"]");
Document doc = (Document)XMLUtils.parse(new StringReader(dst));
if (check) { // inspect the result DOMXDA xda = new DOMXDA(doc,false); check_real_val = xda.getText(check_xp,true); if ((check_real_val!=null) && (check_val!=null) && check_real_val.trim().equals(check_val.trim())) { // accept the result result = doc; } else System.err.println("ERR: the encoding fix failed, the encoding check still fails ["+check_real_val+"!="+check_val+"]!"); } else { // just accept the result result = doc;
|
I should also have a test page around, and some links of my first research. But this was all back in 2004 and before, so it
will be outdated. If you want I can send you, or anybody else who is interested, the code of the accessor. Notice, this is for NK 3. Hope this helps, Menzo
|
|
nk4um Moderator
Posts: 485
|
2010-02-01T17:02:58.000ZFebruary 1, 2010 17:02
Hi Miles,
I''ll take a look into this. What platform are you trying this on (NK3/NK4) ?
Cheers, Tony
|
|
nk4um User
Posts: 101
|
I only posted this response to see if this forum system has the bug. It does.
|
|
nk4um User
Posts: 24
|
If I send in a word (via Firefox) with an accented character in it (U-umlaut, for example), it arrives correctly when I do
it via GET.
However, via POST, it arrives in the Java string with the two UTF-8 characters in the lower bytes, and zero in the upper,
rather than translated as it should be, into the single character.
I suppose this ought to be posted as a bug, but I''m not sure how to do that.
Thanks, -= miles =-
|