Previous: , Up: Fetching from an HTTP server   [Contents][Index]


1.1.3 Handling String Encodings

The result passed to you from curl-easy-perform will either be as a string or as a bytevector, depending on the value of the 2nd parameter. The default is to return a string, which you will receive if you call it this way.

(curl-easy-perform handle)

or

(curl-easy-perform handle #f)

If you are using Guile-1.8, the string will always be returned as an 8-bit string. Data encoded in multi-byte encodings like UTF-8 will be unceremoniously reduced to 8-bit bytes. This is a limitation of Guile 1.8.

For Guile 2.0, the string will always be returned in the ISO-8859-1 encoding, aka the Latin-1 encoding. This is because converting data in some unknown format to Latin-1 will never fail. It may not be right, but, the conversion won’t fail.

If you know that the data to be fetched is to be interpreted as a UTF-8 string, you should request the data as a bytevector and then convert the result to a string, like so...

(utf8->string (curl-easy-perform handle #t)))

But if you don’t know the encoding, things are gonna get way more complicated.

The first and best place to look for an HTTP document’s encoding is in the Content-Type field of the HTTP headers. In the headers, there may be a Content-Type header line like the following.

Content-Type: text/html; charset=utf-8

In the above case, the encoding of the document is listed as UTF-8 text. The previous example demonstrates how to unpack a HTTP header to find its encoding.

But, not all HTTP-served documents have a charset line in the Content-Type header. If it isn’t there, the next most authoritative place to look for an encoding is in the XML declaration if this is an XHTML document. But XHTML is uncommon.

The another place to check is in a meta element in the HTML document itself. The meta element is in the HTML and might look like the following.

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

Yet another place that might be the source of encoding information for the resulting data is a UTF-16 BOM. It is a bit of a muddle.

OK. Once you finally know what the encoding of your HTML document is, you can convert the body of your request into a string. If you’ve requested the HTML document as a bytevector, the bytevector->string procedure could be used to make the conversion.


Previous: , Up: Fetching from an HTTP server   [Contents][Index]