2004年08月06日 星期五 15:32
各位大虾:
RFC2060中规定了中文信箱名的编码问题,现在摘录如下:
5.1.3. Mailbox International Naming Convention
By convention, international mailbox names are specified using a
modified version of the UTF-7 encoding described in [UTF-7]. The
purpose of these modifications is to correct the following problems
with UTF-7:
1) UTF-7 uses the "+" character for shifting; this conflicts with
the common use of "+" in mailbox names, in particular USENET
newsgroup names.
2) UTF-7’s encoding is BASE64 which uses the "/" character; this
conflicts with the use of "/" as a popular hierarchy delimiter.
3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with
the use of "\" as a popular hierarchy delimiter.
4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with
the use of "˜" in some servers as a home directory indicator.
5) UTF-7 permits multiple alternate forms to represent the same
string; in particular, printable US-ASCII chararacters can be
represented in encoded form.
In modified UTF-7, printable US-ASCII characters except for "&"
represent themselves; that is, characters with octet values 0x20-0x25
and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet
sequence "&-".
All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all
Unicode 16-bit octets) are represented in modified BASE64, with a
further modification from [UTF-7] that "," is used instead of "/".
Modified BASE64 MUST NOT be used to represent any printing US-ASCII
character which can represent itself.
"&" is used to shift to modified BASE64 and "-" to shift back to USASCII.
All names start in US-ASCII, and MUST end in US-ASCII (that
is, a name that ends with a Unicode 16-bit octet MUST end with a "-
").
For example, here is a mailbox name which mixes English, Japanese,
and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw-
本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如,
按照以上规定:
“草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;".
各位大虾,如何实现这边的编码和解码?可否示例?
最后一个问题,Python是不错,可惜中文处理实在头疼!
按有的资料介绍,UTF-8的解码和编码可以用如下方法:
s=u"社会主义中国"
u8=s.encode("utf-8") ---转化成utf-8
#转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是"绀句細涓讳箟涓浗"
u8.decode("utf-8") ---转化成unicode
如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码是就会出错:)
Sincerely,
Frank Ning
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.exoweb.net/pipermail/python-chinese/attachments/20040806/6357e03c/attachment.html
2004年08月06日 星期五 15:46
##!/usr/bin/env python
#printu.py
import locale
encoding = locale.getdefaultlocale()[1]
P1="""社会主义中国"""
s1 = unicode(P1, encoding)
#s2 = unicode(P1, "utf-8")
print s1
print s1.encode("utf-8")
print len(s1)
#python printu.py
输出结果:
社会主义中国
绀句細涓讳箟涓浗
6
gavin wrote:
> 各位大虾:
>
> RFC2060中规定了中文信箱名的编码问题,现在摘录如下:
>
> *5.1.3. Mailbox International Naming Convention*
> By convention, international mailbox names are specified using a
> modified version of the UTF-7 encoding described in [UTF-7]. The
> purpose of these modifications is to correct the following problems
> with UTF-7:
>
> 1) UTF-7 uses the "+" character for shifting; this conflicts with
> the common use of "+" in mailbox names, in particular USENET
> newsgroup names.
>
> 2) UTF-7’s encoding is BASE64 which uses the "/" character; this
> conflicts with the use of "/" as a popular hierarchy delimiter.
>
> 3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with
> the use of "\" as a popular hierarchy delimiter.
>
> 4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with
> the use of "˜" in some servers as a home directory indicator.
>
> 5) UTF-7 permits multiple alternate forms to represent the same
> string; in particular, printable US-ASCII chararacters can be
> represented in encoded form.
>
> In modified UTF-7, printable US-ASCII characters except for "&"
> represent themselves; that is, characters with octet values 0x20-0x25
> and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet
> sequence "&-".
>
> All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all
> Unicode 16-bit octets) are represented in modified BASE64, with a
> further modification from [UTF-7] that "," is used instead of "/".
> Modified BASE64 MUST NOT be used to represent any printing US-ASCII
> character which can represent itself.
> "&" is used to shift to modified BASE64 and "-" to shift back to USASCII.
> All names start in US-ASCII, and MUST end in US-ASCII (that
> is, a name that ends with a Unicode 16-bit octet MUST end with a "-
> ").
>
> For example, here is a mailbox name which mixes English, Japanese,
> and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw-
>
>
> 本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如,
> 按照以上规定:
> “草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;".
>
> 各位大虾,如何实现这边的编码和解码?可否示例?
>
>
> 最后一个问题,Python是不错,可惜中文处理实在头疼!
>
> 按有的资料介绍,UTF-8的解码和编码可以用如下方法:
> s=u"社会主义中国"
> u8=s.encode("utf-8") ---转化成utf-8
> #转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀
> 句細涓讳箟涓浗"
> u8.decode("utf-8") ---转化成unicode
>
> 如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码
> 是就会出错:)
>
>
>
>
> Sincerely,
>
> Frank Ning
>
>------------------------------------------------------------------------
>
>_______________________________________________
>python-chinese list
>python-chinese at lists.python.cn
>http://python.cn/mailman/listinfo/python-chinese
>
>
2004年08月06日 星期五 15:50
非常感谢! ----- Original Message ----- From: "gentoo.cn" <gentoo.cn at 126.com> To: <python-chinese at lists.python.cn> Sent: Friday, August 06, 2004 3:46 PM Subject: Re: [python-chinese] 如何解码中文信箱名的编码问题? > ##!/usr/bin/env python > #printu.py > import locale > encoding = locale.getdefaultlocale()[1] > > P1="""社会主义中国""" > s1 = unicode(P1, encoding) > #s2 = unicode(P1, "utf-8") > print s1 > print s1.encode("utf-8") > print len(s1) > > #python printu.py > 输出结果: > 社会主义中国 > 绀句細涓讳箟涓浗 > 6 > > > gavin wrote: > > > 各位大虾: > > > > RFC2060中规定了中文信箱名的编码问题,现在摘录如下: > > > > *5.1.3. Mailbox International Naming Convention* > > By convention, international mailbox names are specified using a > > modified version of the UTF-7 encoding described in [UTF-7]. The > > purpose of these modifications is to correct the following problems > > with UTF-7: > > > > 1) UTF-7 uses the "+" character for shifting; this conflicts with > > the common use of "+" in mailbox names, in particular USENET > > newsgroup names. > > > > 2) UTF-7’s encoding is BASE64 which uses the "/" character; this > > conflicts with the use of "/" as a popular hierarchy delimiter. > > > > 3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with > > the use of "\" as a popular hierarchy delimiter. > > > > 4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with > > the use of "˜" in some servers as a home directory indicator. > > > > 5) UTF-7 permits multiple alternate forms to represent the same > > string; in particular, printable US-ASCII chararacters can be > > represented in encoded form. > > > > In modified UTF-7, printable US-ASCII characters except for "&" > > represent themselves; that is, characters with octet values 0x20-0x25 > > and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet > > sequence "&-". > > > > All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all > > Unicode 16-bit octets) are represented in modified BASE64, with a > > further modification from [UTF-7] that "," is used instead of "/". > > Modified BASE64 MUST NOT be used to represent any printing US-ASCII > > character which can represent itself. > > "&" is used to shift to modified BASE64 and "-" to shift back to USASCII. > > All names start in US-ASCII, and MUST end in US-ASCII (that > > is, a name that ends with a Unicode 16-bit octet MUST end with a "- > > "). > > > > For example, here is a mailbox name which mixes English, Japanese, > > and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw- > > > > > > 本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如, > > 按照以上规定: > > “草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;". > > > > 各位大虾,如何实现这边的编码和解码?可否示例? > > > > > > 最后一个问题,Python是不错,可惜中文处理实在头疼! > > > > 按有的资料介绍,UTF-8的解码和编码可以用如下方法: > > s=u"社会主义中国" > > u8=s.encode("utf-8") ---转化成utf-8 > > #转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀 > > 句細涓讳箟涓浗" > > u8.decode("utf-8") ---转化成unicode > > > > 如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码 > > 是就会出错:) > > > > > > > > > > Sincerely, > > > > Frank Ning
2004年08月06日 星期五 16:07
> ##!/usr/bin/env python
> #printu.py
> import locale
> encoding = locale.getdefaultlocale()[1]
>
> P1="""社会主义中国"""
> s1 = unicode(P1, encoding)
> #s2 = unicode(P1, "utf-8")
> print s1
> print s1.encode("utf-8")
> print len(s1)
>
> #python printu.py
> 输出结果:
> 社会主义中国
> 绀句細涓讳箟涓浗
> 6
>
好像不成功啊:)
>>> import locale
>>> encoding = locale.getdefaultlocale()[1]
>>> P1="""社会主义中国"""
>>> s1 = unicode(P1, encoding)
Traceback (most recent call last):
File "", line 1, in ?
LookupError: unknown encoding: gb18030
>>> s1 = unicode(P1, "utf-8")
Traceback (most recent call last):
File " ", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data
再次请教,如果我从一个文本中读取了"绀句細涓讳箟涓浗",如何将它转化成“社会主义中国”
用
>
> gavin wrote:
>
> > 各位大虾:
> >
> > RFC2060中规定了中文信箱名的编码问题,现在摘录如下:
> >
> > *5.1.3. Mailbox International Naming Convention*
> > By convention, international mailbox names are specified using a
> > modified version of the UTF-7 encoding described in [UTF-7]. The
> > purpose of these modifications is to correct the following problems
> > with UTF-7:
> >
> > 1) UTF-7 uses the "+" character for shifting; this conflicts with
> > the common use of "+" in mailbox names, in particular USENET
> > newsgroup names.
> >
> > 2) UTF-7’s encoding is BASE64 which uses the "/" character; this
> > conflicts with the use of "/" as a popular hierarchy delimiter.
> >
> > 3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with
> > the use of "\" as a popular hierarchy delimiter.
> >
> > 4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with
> > the use of "˜" in some servers as a home directory indicator.
> >
> > 5) UTF-7 permits multiple alternate forms to represent the same
> > string; in particular, printable US-ASCII chararacters can be
> > represented in encoded form.
> >
> > In modified UTF-7, printable US-ASCII characters except for "&"
> > represent themselves; that is, characters with octet values 0x20-0x25
> > and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet
> > sequence "&-".
> >
> > All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all
> > Unicode 16-bit octets) are represented in modified BASE64, with a
> > further modification from [UTF-7] that "," is used instead of "/".
> > Modified BASE64 MUST NOT be used to represent any printing US-ASCII
> > character which can represent itself.
> > "&" is used to shift to modified BASE64 and "-" to shift back to USASCII.
> > All names start in US-ASCII, and MUST end in US-ASCII (that
> > is, a name that ends with a Unicode 16-bit octet MUST end with a "-
> > ").
> >
> > For example, here is a mailbox name which mixes English, Japanese,
> > and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw-
> >
> >
> > 本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如,
> > 按照以上规定:
> > “草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;".
> >
> > 各位大虾,如何实现这边的编码和解码?可否示例?
> >
> >
> > 最后一个问题,Python是不错,可惜中文处理实在头疼!
> >
> > 按有的资料介绍,UTF-8的解码和编码可以用如下方法:
> > s=u"社会主义中国"
> > u8=s.encode("utf-8") ---转化成utf-8
> > #转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀
> > 句細涓讳箟涓浗"
> > u8.decode("utf-8") ---转化成unicode
> >
> > 如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码
> > 是就会出错:)
> >
> >
> >
> >
2004年08月06日 星期五 16:07
> ##!/usr/bin/env python
> #printu.py
> import locale
> encoding = locale.getdefaultlocale()[1]
>
> P1="""社会主义中国"""
> s1 = unicode(P1, encoding)
> #s2 = unicode(P1, "utf-8")
> print s1
> print s1.encode("utf-8")
> print len(s1)
>
> #python printu.py
> 输出结果:
> 社会主义中国
> 绀句細涓讳箟涓浗
> 6
>
好像不成功啊:)
>>> import locale
>>> encoding = locale.getdefaultlocale()[1]
>>> P1="""社会主义中国"""
>>> s1 = unicode(P1, encoding)
Traceback (most recent call last):
File "", line 1, in ?
LookupError: unknown encoding: gb18030
>>> s1 = unicode(P1, "utf-8")
Traceback (most recent call last):
File " ", line 1, in ?
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data
再次请教,如果我从一个文本中读取了"绀句細涓讳箟涓浗",如何将它转化成“社会主义中国”
用
>
> gavin wrote:
>
> > 各位大虾:
> >
> > RFC2060中规定了中文信箱名的编码问题,现在摘录如下:
> >
> > *5.1.3. Mailbox International Naming Convention*
> > By convention, international mailbox names are specified using a
> > modified version of the UTF-7 encoding described in [UTF-7]. The
> > purpose of these modifications is to correct the following problems
> > with UTF-7:
> >
> > 1) UTF-7 uses the "+" character for shifting; this conflicts with
> > the common use of "+" in mailbox names, in particular USENET
> > newsgroup names.
> >
> > 2) UTF-7’s encoding is BASE64 which uses the "/" character; this
> > conflicts with the use of "/" as a popular hierarchy delimiter.
> >
> > 3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with
> > the use of "\" as a popular hierarchy delimiter.
> >
> > 4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with
> > the use of "˜" in some servers as a home directory indicator.
> >
> > 5) UTF-7 permits multiple alternate forms to represent the same
> > string; in particular, printable US-ASCII chararacters can be
> > represented in encoded form.
> >
> > In modified UTF-7, printable US-ASCII characters except for "&"
> > represent themselves; that is, characters with octet values 0x20-0x25
> > and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet
> > sequence "&-".
> >
> > All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all
> > Unicode 16-bit octets) are represented in modified BASE64, with a
> > further modification from [UTF-7] that "," is used instead of "/".
> > Modified BASE64 MUST NOT be used to represent any printing US-ASCII
> > character which can represent itself.
> > "&" is used to shift to modified BASE64 and "-" to shift back to USASCII.
> > All names start in US-ASCII, and MUST end in US-ASCII (that
> > is, a name that ends with a Unicode 16-bit octet MUST end with a "-
> > ").
> >
> > For example, here is a mailbox name which mixes English, Japanese,
> > and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw-
> >
> >
> > 本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如,
> > 按照以上规定:
> > “草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;".
> >
> > 各位大虾,如何实现这边的编码和解码?可否示例?
> >
> >
> > 最后一个问题,Python是不错,可惜中文处理实在头疼!
> >
> > 按有的资料介绍,UTF-8的解码和编码可以用如下方法:
> > s=u"社会主义中国"
> > u8=s.encode("utf-8") ---转化成utf-8
> > #转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀
> > 句細涓讳箟涓浗"
> > u8.decode("utf-8") ---转化成unicode
> >
> > 如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码
> > 是就会出错:)
> >
> >
> >
> >
2004年08月06日 星期五 16:24
你在什么平台上执行? locale是什么? or U can try http://cjkpython.i18n.org/ gavin wrote: >>##!/usr/bin/env python >>#printu.py >>import locale >>encoding = locale.getdefaultlocale()[1] >> >>P1="""社会主义中国""" >>s1 = unicode(P1, encoding) >>#s2 = unicode(P1, "utf-8") >>print s1 >>print s1.encode("utf-8") >>print len(s1) >> >>#python printu.py >>输出结果: >>社会主义中国 >>绀句細涓讳箟涓浗 >>6 >> >> >> >好像不成功啊:) > > >>>>import locale >>>>encoding = locale.getdefaultlocale()[1] >>>>P1="""社会主义中国""" >>>>s1 = unicode(P1, encoding) >>>> >>>> >Traceback (most recent call last): > File "", line 1, in ? >LookupError: unknown encoding: gb18030 > > >>>>s1 = unicode(P1, "utf-8") >>>> >>>> >Traceback (most recent call last): > File "", line 1, in ? >UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data > > >再次请教,如果我从一个文本中读取了"绀句細涓讳箟涓浗",如何将它转化成“社会主义中国” >用 > > > >>gavin wrote: >> >> >> >>>各位大虾: >>> >>>RFC2060中规定了中文信箱名的编码问题,现在摘录如下: >>> >>>*5.1.3. Mailbox International Naming Convention* >>>By convention, international mailbox names are specified using a >>>modified version of the UTF-7 encoding described in [UTF-7]. The >>>purpose of these modifications is to correct the following problems >>>with UTF-7: >>> >>>1) UTF-7 uses the "+" character for shifting; this conflicts with >>>the common use of "+" in mailbox names, in particular USENET >>>newsgroup names. >>> >>>2) UTF-7’s encoding is BASE64 which uses the "/" character; this >>>conflicts with the use of "/" as a popular hierarchy delimiter. >>> >>>3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with >>>the use of "\" as a popular hierarchy delimiter. >>> >>>4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with >>>the use of "˜" in some servers as a home directory indicator. >>> >>>5) UTF-7 permits multiple alternate forms to represent the same >>>string; in particular, printable US-ASCII chararacters can be >>>represented in encoded form. >>> >>>In modified UTF-7, printable US-ASCII characters except for "&" >>>represent themselves; that is, characters with octet values 0x20-0x25 >>>and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet >>>sequence "&-". >>> >>>All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all >>>Unicode 16-bit octets) are represented in modified BASE64, with a >>>further modification from [UTF-7] that "," is used instead of "/". >>>Modified BASE64 MUST NOT be used to represent any printing US-ASCII >>>character which can represent itself. >>>"&" is used to shift to modified BASE64 and "-" to shift back to USASCII. >>>All names start in US-ASCII, and MUST end in US-ASCII (that >>>is, a name that ends with a Unicode 16-bit octet MUST end with a "- >>>"). >>> >>>For example, here is a mailbox name which mixes English, Japanese, >>>and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw- >>> >>> >>>本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如, >>>按照以上规定: >>>“草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;". >>> >>>各位大虾,如何实现这边的编码和解码?可否示例? >>> >>> >>>最后一个问题,Python是不错,可惜中文处理实在头疼! >>> >>>按有的资料介绍,UTF-8的解码和编码可以用如下方法: >>>s=u"社会主义中国" >>>u8=s.encode("utf-8") ---转化成utf-8 >>>#转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀 >>>句細涓讳箟涓浗" >>>u8.decode("utf-8") ---转化成unicode >>> >>>如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码 >>>是就会出错:) >>> >>> >>> >>> >>> >>>
2004年08月06日 星期五 17:37
utf8的解码编码试成功了,多谢:) 从http://cjkpython.i18n.org/下载CJKCodecs包,编译安装 >>> import locale >>> encoding=locale.getdefaultlocale()[1] >>> P1="社会主义中国" >>> s1=unicode(P1,encoding) >>> s1 u'\u793e\u4f1a\u4e3b\u4e49\u4e2d\u56fd' >>> s=s1.encode("utf-8") >>> print s 绀句細涓讳箟涓浗 >>> l="绀句細涓讳箟涓浗" >>> p=l.decode("utf-8") >>> p u'\u793e\u4f1a\u4e3b\u4e49\u4e2d\u56fd' >>> p.encode(encoding) '\xc9\xe7\xbb\xe1\xd6\xf7\xd2\xe5\xd6\xd0\xb9\xfa' >>> print p.encode(encoding) 社会主义中国 >>> P1 '\xc9\xe7\xbb\xe1\xd6\xf7\xd2\xe5\xd6\xd0\xb9\xfa' ----- Original Message ----- From: "gentoo.cn" <gentoo.cn at 126.com> To: "gavin" <gavin at sz.net.cn> Cc: <python-chinese at lists.python.cn> Sent: Friday, August 06, 2004 4:24 PM Subject: Re: [python-chinese] 如何解码中文信箱名的编码问题? > 你在什么平台上执行? > locale是什么? > or U can try > http://cjkpython.i18n.org/ > > > > gavin wrote: > > >>##!/usr/bin/env python > >>#printu.py > >>import locale > >>encoding = locale.getdefaultlocale()[1] > >> > >>P1="""社会主义中国""" > >>s1 = unicode(P1, encoding) > >>#s2 = unicode(P1, "utf-8") > >>print s1 > >>print s1.encode("utf-8") > >>print len(s1) > >> > >>#python printu.py > >>输出结果: > >>社会主义中国 > >>绀句細涓讳箟涓浗 > >>6 > >> > >> > >> > >好像不成功啊:) > > > > > >>>>import locale > >>>>encoding = locale.getdefaultlocale()[1] > >>>>P1="""社会主义中国""" > >>>>s1 = unicode(P1, encoding) > >>>> > >>>> > >Traceback (most recent call last): > > File "", line 1, in ? > >LookupError: unknown encoding: gb18030 > > > > > >>>>s1 = unicode(P1, "utf-8") > >>>> > >>>> > >Traceback (most recent call last): > > File "", line 1, in ? > >UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data > > > > > >再次请教,如果我从一个文本中读取了"绀句細涓讳箟涓浗",如何将它转化成“社会主义中国” > >用 > > > > > > > >>gavin wrote: > >> > >> > >> > >>>各位大虾: > >>> > >>>RFC2060中规定了中文信箱名的编码问题,现在摘录如下: > >>> > >>>*5.1.3. Mailbox International Naming Convention* > >>>By convention, international mailbox names are specified using a > >>>modified version of the UTF-7 encoding described in [UTF-7]. The > >>>purpose of these modifications is to correct the following problems > >>>with UTF-7: > >>> > >>>1) UTF-7 uses the "+" character for shifting; this conflicts with > >>>the common use of "+" in mailbox names, in particular USENET > >>>newsgroup names. > >>> > >>>2) UTF-7’s encoding is BASE64 which uses the "/" character; this > >>>conflicts with the use of "/" as a popular hierarchy delimiter. > >>> > >>>3) UTF-7 prohibits the unencoded usage of "\"; this conflicts with > >>>the use of "\" as a popular hierarchy delimiter. > >>> > >>>4) UTF-7 prohibits the unencoded usage of "˜"; this conflicts with > >>>the use of "˜" in some servers as a home directory indicator. > >>> > >>>5) UTF-7 permits multiple alternate forms to represent the same > >>>string; in particular, printable US-ASCII chararacters can be > >>>represented in encoded form. > >>> > >>>In modified UTF-7, printable US-ASCII characters except for "&" > >>>represent themselves; that is, characters with octet values 0x20-0x25 > >>>and 0x27-0x7e. The character "&" (0x26) is represented by the twooctet > >>>sequence "&-". > >>> > >>>All other characters (octet values 0x00-0x1f, 0x7f-0xff, and all > >>>Unicode 16-bit octets) are represented in modified BASE64, with a > >>>further modification from [UTF-7] that "," is used instead of "/". > >>>Modified BASE64 MUST NOT be used to represent any printing US-ASCII > >>>character which can represent itself. > >>>"&" is used to shift to modified BASE64 and "-" to shift back to USASCII. > >>>All names start in US-ASCII, and MUST end in US-ASCII (that > >>>is, a name that ends with a Unicode 16-bit octet MUST end with a "- > >>>"). > >>> > >>>For example, here is a mailbox name which mixes English, Japanese, > >>>and Chinese text: ˜peter/mail/&ZeVnLIqe-;/&U;,BTFw- > >>> > >>> > >>>本人看了半天还是不知道如何在python中进行中文信箱名的解码和编码,比如, > >>>按照以上规定: > >>>“草稿箱”编码以后为:"&g0l6P3ux-;";"发件箱"编码以后为:"&U9FO9nux-;". > >>> > >>>各位大虾,如何实现这边的编码和解码?可否示例? > >>> > >>> > >>>最后一个问题,Python是不错,可惜中文处理实在头疼! > >>> > >>>按有的资料介绍,UTF-8的解码和编码可以用如下方法: > >>>s=u"社会主义中国" > >>>u8=s.encode("utf-8") ---转化成utf-8 > >>>#转化以后是“脡莽禄谩脰梅脪氓脰脨鹿煤”,而别的应用从gb2312转换后是" 绀 > >>>句細涓讳箟涓浗" > >>>u8.decode("utf-8") ---转化成unicode > >>> > >>>如果读取别的系统转换后的“绀句細涓讳箟涓浗”(utf-8),采用上述方法解码 > >>>是就会出错:) > >>> > >>> > >>> > >>> > >>> > >>> > >
Zeuux © 2025
京ICP备05028076号