How to handle special characters?

SaidAkif · ‎03-04-2015

Hi all,

I developed some arctoolboxes using python code to do some process on shapefiles. Some of shapefiles fileds contain special Characters as é. at each time my script crashed when it meet those characters in fiel values. What I did until now is just replacing é by e as an example. Some time, it is a lot of work to find/replace all the special characters. I looked on the web for solutions to this problem but unfortunatly, i did not find anything to use without find/replace.

So, I am wondering if there is a solution for this problem?

Thanks

BerendVeldkamp · ‎03-04-2015

What kind of error do you get, could you post a small snippet to demonstrate it?

One thing that will fail is for instance getting a row value and explicitly converting it to str. This can be solved by removing that conversion.

x = str(row[0]) # fails
x = row[0]      # correct
print(type(x))  # will print: <type 'unicode'>

If you have any special characters in the source file itself, you could add this line at the top to specify it's using UTF-8

# -*- coding: utf-8 -*-
x = u"éè"

BruceHarold · ‎03-04-2015

Berend is correct, we'll need to see your code to help, but what is happening is your shapefile is using a character encoding which you'll need to know in order to fix your processing, or you'll need to copy the data to a geodatabase so it is in a known encoding - UTF-8. If there is a .cpg file with your shapefile then you can read it to get the encoding, otherwise you'll have to guess, which may get frustrating.

SaidAkif · ‎03-04-2015

thanks

most of the time the special character are related to french languge. I know the list of those specil characters

Now, if I do : x=u'all specil characters', how i can use x in my code?

sorry i am began in python

thanks

BruceHarold · ‎03-04-2015

Hi

You're going to need to use decode() to get a unicode object, but you'll need to know the encoding to supply the argument it needs.

BerendVeldkamp · ‎03-04-2015

@Bruce: That actually depends, and may even be unicode (utf-8): See 21106 - Read and write shapefile and dBASE files encoded in various code pages. I think that if the dbf is in a specific codepage (not unicode), Python wouldn't break, it would merely display wrong characters.

That said, without seeing the actual code and the data it is all rather speculative.

SaidAkif · ‎03-04-2015

Thanks for your input

I want just to underline that I faced this problem when I used SelectByAttributes. I have to keep my content of fields as they are without any changes. I believe that I used the decode() but it did not work. Perhaps i used it in the wrong way.