FAQ
6. Your burning questions answered!
Q: Is UTF-16 just a superset of UCS-2?
A: Not exactly. While UTF-16 can represent all the characters that UCS-2 can, it also has the ability to represent characters outside the BMP using surrogate pairs, which UCS-2 cannot do. Think of it as UTF-16 being backwards compatible with UCS-2, but adding extra features.
Q: Why does UTF-16 LE BOM use the same BOM as UCS-2 LE BOM?
A: The BOM (FF FE) simply indicates the byte order (little-endian) and that the encoding is either UCS-2 or UTF-16. The actual interpretation of the characters following the BOM determines whether it's UCS-2 or UTF-16. If surrogate pairs are present, it's UTF-16; otherwise, it can be interpreted as UCS-2.
Q: What happens if I try to open a UTF-16 encoded file with a UCS-2 decoder?
A: If the file contains characters outside the BMP, the UCS-2 decoder will likely misinterpret the surrogate pairs, resulting in garbled or incorrect characters. The behavior can vary depending on the decoder, but it will generally not produce the intended result.
Q: Are there any performance differences between UCS-2 and UTF-16?
A: In general, the performance differences are negligible in modern systems. The overhead of handling surrogate pairs in UTF-16 is minimal. However, if you are working with a very large dataset and are absolutely certain that it only contains BMP characters, UCS-2 might offer a slight performance advantage. However, the risk of encountering non-BMP characters usually outweighs any potential performance gain.