Jump to content


Check out our Community Blogs

Register and join over 40,000 other developers!


Recent Status Updates

View All Updates

Photo
- - - - -

Python Encoding Problems

python encoding http proxy

This topic has been archived. This means that you cannot reply to this topic.
1 reply to this topic

#1 mitchfizz05

mitchfizz05

    CC Lurker

  • New Member
  • Pip
  • 8 posts

Posted 21 November 2014 - 02:59 PM

Hi! (Haven't posted here for ages..)

 

I've been working on a basic HTTP proxy thingy for the sake of experimenting with Python (Python 3, to be precise), and I've ran into this road block...

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 291: invalid start byte

Sometimes the remote server returns some bytes that can't be decoded with UTF-8, nor ISO-8859-1. Now it's got me stumped, how on the Earth do I know the encoding of the HTTP response when I can't read the Content-type header since I can't decode the HTTP headers!  :confused:

 

(These un decode able bytes seem to appear in the content body)

 

Any help would be awesome.  :biggrin:



#2 Alexander

Alexander

    YOL9

  • Moderator
  • 3963 posts

Posted 21 November 2014 - 06:16 PM

As a majority of the content served by the internet is compressible, especially text, servers often send it as a gzip stream. Such binary content often has "magic numbers", often ASCII characters to reveal what the content will be after the header:

http://en.wikipedia.org/wiki/Gzip

 

Wikipedia says this magic number set should be 1f 8b.

 

Any HTTP library should have the facilities of course to the content encoding:

response.info().get('Content-Encoding') == 'gzip'

You can utilise the gzip library in Python if the response evaluates to true.

 

If this is not the situation you are seeing, may you provide us with a file dump (or complete hex string) of what content you are seeing?

 

Alexander.


All new problems require investigation, and so if errors are problems, try to learn as much as you can and report back.