最近在对bismarck进行升级,主要是从Python2迁移到Python3,并更换爬取方案。

结果出师不利,在将爬取到的商品标题print出来时,抛出错误:

  root@fb6e7c6fbe5c:/home/binss# python3 amazon_test.py
  Traceback (most recent call last):
  File "amazon_test.py", line 30, in
  print(s)
  UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-7: ordinal not in range(128)

Python2时代最怕就是这个UnicodeEncodeError,没想到到了Python3,又见到它了。

查看第一个字符,发现为'\u8266',于是测试以下代码:

  >>> print('\u8266')

果然报错

  Traceback (most recent call last):
  File "", line 1, in
  UnicodeEncodeError: 'ascii' codec can't encode character '\u8266' in position 0: ordinal not in range(128)

尝试了各种姿势,结果还是没能解决。

最后突发奇想,print不行,那我把其输出到文件捏?

  >>> s = '\u8266'
  >>> with open('xxx.txt', mode='w') as pubilc_file:
  ... pubilc_file.write(s)

依然报错

  Traceback (most recent call last):
  File "", line 2, in
  UnicodeEncodeError: 'ascii' codec can't encode character '\u8266' in position 0: ordinal not in range(128)

那换成二进制输出呢?

  >>> s = '\u8266'.encode('utf-8')
  >>> with open('xxx.txt', mode='wb') as pubilc_file:
  ... pubilc_file.write(s)

竟然成功输出了正确的字符——"艦"!这,难道是因为终端的stdout不支持utf-8输出?

于是打印看看当前的stdout是啥

  root@fb6e7c6fbe5c:/home/binss# python3
  Python 3.5.1 (default, Dec 18 2015, 00:00:00)
  [GCC 4.8.4] on linux
  Type "help", "copyright", "credits" or "license" for more information.
  >>> import sys
  >>> sys.stdout
  <_io.TextIOWrapper name='' mode='w' encoding='ANSI_X3.4-1968'>
  >>>

这个ANSI_X3.4-1968的编码是什么东西?怎么不是utf-8?以此为关键词Google,终于搜到相关文章:

http://lab.knightstyle.info/私がpython3でunicodeencodeerrorなのはどう考えてもデフォルト文字/

大概意思就是如果要输出utf-8,需要通过以下代码将ANSI_X3.4-1968改为utf-8

import sys
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')

然后再次检验stdout是否为utf-8

  >>> sys.stdout
  <_io.TextIOWrapper name='' encoding='utf-8'>

之后就可以愉快地print了

  >>> print('\u8266')
 

via.https://www.binss.me/blog/solve-problem-of-python3-raise-unicodeencodeerror-when-print-utf8-string/