QGIS and GDAL>=1.9 Encoding Issue: a Workaround

Posted: March 11, 2012 in GIS
Tags: , , , ,

After significant encoding handling changes in GDAL>=1.9 it became quite challenging to handle non-Latin attributes (Cyrillic in particular) stored in .dbf part of a shp-file. So the majority of the Cyrillic and other non-Latin users have to stick with GDAl 1.8.1 for now.

The cause of the issue is described here (Rus). Briefly: GDAL>=1.9 attemts to re-encode the .dbf attributes to UTF-8 on the basis of the LDID (Language Driver ID) written in .dbf header. But unfortunately LDID is usually missing, and in particular QGIS does not write it to the .dbf-file it creates. In case when LDID is missing, GDAL>=1.9 assumes that encoding of the .dbf-file is ISO8859_1 (Latin-1) which makes non-Latin characters unreadable.

The workaround I’m currently using is creating additional .cpg-file, that contains the ID of the encoding used. For example if encoding is Windows-1251, .cpg-file contains following record: “1251” (without quotes). When .cpg-file is present, GDAL>=1.9 + QGIS works just fine.

————————————————————————-

UPD: on some OS you will need to use ID from Additional ID column instead of Encoding ID column.

For Windows you may also try to use “unofficial” version of QGIS from here (with encoding issue solved). But it is possible that its installer is in Russian. 

UPD2: There is another workaround. You can open .dbf-file in Libre Office Calc (Open Office Calc) providing encoding needed and save it from there. This will write necessary header to .dbf-file and QGIS will open attributes correctly.  Note that this also will make fields names written in upper case.

UPD3: there is a plugin for encoding fixing available.

————————————————————————-

Here you are a table of the encoding IDs (taken from here):

Encoding ID Encodind name Additional ID Other names
1252 Western iso-8859-1
except when 128-159 is used, use “Windows-1252”
iso8859-1, iso_8859-1, iso-8859-1, ANSI_X3.4-1968, iso-ir-6, ANSI_X3.4-1986, ISO_646, irv:1991, ISO646-US, us, IBM367, cp367, csASCII, latin1, iso_8859-1:1987, iso-ir-100, ibm819, cp819, Windows-1252
20105 us-ascii us-acii, ascii
28592 Central European (ISO) iso-8859-2 iso8859-2, iso-8859-2, iso_8859-2, latin2, iso_8859-2:1987, iso-ir-101, l2, csISOLatin2
1250 Central European (Windows) Windows-1250 Windows-1250, x-cp1250
1251 Cyrillic (Windows) Windows-1251 Windows-1251, x-cp1251
1253 Greek (Windows) Windows-1253 Windows-1253
1254 Turkish (Windows) Windows-1254 Windows-1254
932 Japanese (Shift-JIS) shift_jis shift_jis, x-sjis, ms_Kanji, csShiftJIS, x-ms-cp932
51932 Japanese (EUC) x-euc-jp Extended_UNIX_Code_Packed_Format_for_Japanese, csEUCPkdFmtJapanese, x-euc-jp, x-euc
50220 Japanese (JIS) iso-2022-jp csISO2022JP, iso-2022-jp
1257 Baltic (Windows) Windows-1257 windows-1257
950 Traditional Chinese (BIG5) big5 big5, csbig5, x-x-big5
936 Simplified Chinese (GB2312) gb2312 GB_2312-80, iso-ir-58, chinese, csISO58GB231280, csGB2312, gb2312
20866 Cyrillic (KOI8-R) koi8-r csKOI8R, koi8-r
949 Korean (KSC5601) ks_c_5601 ks_c_5601, ks_c_5601-1987, korean, csKSC56011987
1255 (logical) Hebrew (ISO-logical) Windows-1255 iso-8859-8i
1255 (visual) Hebrew (ISO-Visual) iso-8859-8 ISO-8859-8 Visual, ISO-8859-8 , ISO_8859-8, visual
862 Hebrew (DOS) dos-862 dos-862
1256 Arabic (Windows) Windows-1256 Windows-1256
720 Arabic (DOS) dos-720 dos-720
874 Thai Windows-874 Windows-874
1258 Vietnamese Windows-1258 Windows-1258
65001 Unicode UTF-8 UTF-8 UTF-8, unicode-1-1-utf-8, unicode-2-0-utf-8
65000 Unicode UTF-7 UNICODE-1-1-UTF-7 utf-7, UNICODE-1-1-UTF-7, csUnicode11UTF7, utf-7
50225 Korean (ISO) ISO-2022-KR ISO-2022-KR, csISO2022KR
52936 Simplified Chinese (HZ) HZ-GB-2312 HZ-GB-2312
28594 Baltic (ISO) iso-8869-4 ISO_8859-4:1988, iso-ir-110, ISO_8859-4, ISO-8859-4, latin4, l4, csISOLatin4
28585 Cyrillic (ISO) iso_8859-5 ISO_8859-5:1988, iso-ir-144, ISO_8859-5, ISO-8859-5, cyrillic, csISOLatinCyrillic, csISOLatin5
28597 Greek (ISO) iso-8859-7 ISO_8859-7:1987, iso-ir-126, ISO_8859-7, ISO-8859-7, ELOT_928, ECMA-118, greek, greek8, csISOLatinGreek
28599 Turkish (ISO) iso-8859-9 ISO_8859-9:1989, iso-ir-148, ISO_8859-9, ISO-8859-9, latin5, l5, csISOLatin5
Advertisements
Comments
  1. Hi Rebelious,

    And where we put the .cpg-file?

    Thanks!

    Best regards

  2. No, it doesn’t work (windows 7 64 bit, QGIS 1.8). Putting a .cpg file with the same name and the code 1250 changes nothing. That makes me use QGIS 1.7.
    Thanks anyway!

    • Actually it is more GDAL than QGIS issue so you need get back not only to QGIS 1.7, but to GDAL 1.8.1 too. You may try to download “unofficial” version of QGIS (with encoding issue solved) for Windows from here [http://nextgis.ru/nextgis-qgis/], but it is possible that the installer is in Russian only.

  3. Great! Russian installer is not a problem :-).

  4. NextGIS works like a charm! Only one plugin had a problem (contour plugin 0.91).

  5. fegyi001 says:

    this worked for me, but not with the encoding ID, but with the third column. In the config file I just inserted UTF-8, and ever since everything works as desired! thanks!

  6. borys says:

    There is also a plugin for setting cpg files (and more) available: http://plugins.qgis.org/plugins/shapefile_encoding_fixer/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s