The result passed to you from
curl-easy-perform will either be
as a string or as a bytevector, depending on the value of the 2nd
parameter. The default is to return a string, which you will receive
if you call it this way.
(curl-easy-perform handle #f)
The string will always be returned in the ISO-8859-1 encoding, aka the Latin-1 encoding. This is because converting data in some unknown format to Latin-1 will never fail. It may not be right, but, the conversion won’t fail.
If you know that the data to be fetched is to be interpreted as a UTF-8 string, you should request the data as a bytevector and then convert the result to a string, like so...
(utf8->string (curl-easy-perform handle #t)))
But if you don’t know the encoding, things are gonna get way more complicated.
The first and best place to look for an HTTP document’s encoding is in
Content-Type field of the HTTP headers. In the headers,
there may be a
Content-Type header line like the following.
Content-Type: text/html; charset=utf-8
In the above case, the encoding of the document is listed as UTF-8 text. The previous example demonstrates how to unpack a HTTP header to find its encoding.
But, not all HTTP-served documents have a
charset line in the
Content-Type header. If it isn’t there, the next most
authoritative place to look for an encoding is in the XML declaration
if this is an XHTML document. But XHTML is uncommon.
The another place to check is in a
meta element in the HTML
document itself. The
meta element is in the HTML and might
look like the following.
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
Yet another place that might be the source of encoding information for the resulting data is a UTF-16 BOM. It is a bit of a muddle.
OK. Once you finally know what the encoding of your HTML document is,
you can convert the body of your request into a string. If you’ve
requested the HTML document as a bytevector, the
bytevector->string procedure could be used to make the conversion.