I have a problem which I think is charset-related. I have written a little app which does the following:
1. It downloads, using 'wget', a web page and saves it as X
2. When I access a webpage, a php script looks for the file X (using scandir and looping for matches) and reads and outputs its content
Simple enough. This works fine for "normal" characters. But whenever special characters, such as åäö, are used the script will decide that the file X (e.g. "läte") does not exist. After a lot of trial and error I can conclude that:
1. The script will state that the two strings do not compare (testing with ===, ==, strcmp, stricmp) even though they are output identically (i.e. the script states that "läte" does not equal "läte").
2. Using mb_detect_encoding(), PHP claims that both strings are encoded with UTF-8. (If the filename does not contain åäö it states it's encoded with ASCII)
3. If I encode the strings with htmlentities(), the output is different however:
String from PHP: lÃ¤te
String from file system: laÌ�te
4. The length of the strings is also different; the string from the file system is constantly one character longer.
I have tested this on both Mac OS and on Ubuntu with the same result. Both systems are setup with UTF-8, confirmed both by running "locale" and the headers received from Apache.
Any ideas what's going on? How can I match a multibyte file system string with the same multibyte string specified in php?