IBX Strings and Character Set Handling

A Firebird Database can specify a wide range of character sets for character and text mode blob columns. A client application can choose to read each column in its native character set or to have the Firebird Client library transliterate on its behalf. For example, adding:

lc_ctype=UTF8

to a TIBDatabase's params property will cause all character data (other than character set 'none' or 'octetstring') to be transliterated into UTF8, regardless of the column's defined character set.

For most applications, this is probably the easiest way to handle multiple character sets. However, some more specialist applications may need to have access to native character set encoding.

From release 1.4.0, IBX provides the encoded character set width and character set name of each character mode column. It also uses this to compute the correct byte size for the field buffer without having to oversize the field's display width.

FPC 3.0.0 onwards. FPC 3.0.0 introduces AnsiStrings with the codepage as a property of the string (see http://wiki.freepascal.org/FPC_Unicode_support#DefaultSystemCodePage). From release 1.4.1 onwards:

•TIBStringField.AsString and TIBMemoField.AsString will always transliterate characters to UTF8 regardless of the setting of the lc_ctype under the field's character set is “NONE” or “OCTETSTRING”. This is because the LCL can only correctly handle UTF8.
•Assigning to TIBStringField.AsString and TIBMemoField.AsString will now result in transliteration to the code page specified for the Firebird driver if the assigned string has a different code page.
•TIBSQL Fields and Params “AsString” property will always return a string in the character set used by Firebird for the field. When assigned to, the string will be transliterated to the Firebird Field's character set if necessary.

There is also a new TIBDatabase property UseDefaultSystemCodePage. When this is set to true, any hardcoded lc_ctype is ignored and IBX will choose the lc_ctype that best matches the DefaultSystemCodePage.

Unless you have specialist string handling requirements, an lc-ctype of UTF8 is probably the best choice for Lazarus.

The Theory

The TDataSet model demands that a dataset provides information about its column types in a TFieldDefs collection, comprising a TFieldDef for each column identifying its type, column name and other useful information. This is used, for example, by the Lazarus IDE's Fields Editor to create a list of dataset fields. Selecting a field name in the Object Inspector also uses the field defs.

IBX subclasses the TFieldDef to create its own extended field def (TIBFieldDef). This contains the Character Set Name and Character Set Size for string and memo type fields, where the Character Set Size is a number in the range 1..4 that gives the maximum number of bytes in which a character can be encoded. For example, UTF8 has a character set size of 4.

The TFieldDef is also used when dataset is opened and the Field objects are created, or associated with fields added to a form by the IDE's Fields Editor. In IBX, instead of the standard field types for text fields, subclassed versions are used to provide the extended information. These are:

•TIBStringField
•TIBMemoField

These all have the additional properties:

•CharacterSetName
•CharacterSetSize

which, respectively, provide the Character Set Name and Size for the text string. The application may then process them accordingly.

These field types also adjust the field size so that it returns the character width in the column multiplied by the character set size, while the DisplayWidth property should always return the character width of the column. This should ensure that the appropriate buffer size is allocated for the column whilst avoiding oversizing the DisplayWidth (and thus making automatic TDBGrid columns too wide).

The Memo Field types also have a new published boolean property:

•DisplayTextAsClassName

This defaults to false.

When true, the field's DisplayText property returns the text content of the field truncated to the DisplayWidth (number of characters to be displayed).

For IBX Memo Fields, the DisplayWidth defaults to 128 characters. If this is overridden and set to zero, then the DisplayText is not truncated.

On the other hand, if the DisplayTextAsClassName is false, then the inherited behaviour is used and the DisplayText returns the classname in square brackets. The DisplayWidth is the inherited default (currently 10).

FPC 3.0.0 onwards. TIBStringField and TIBMemoField also have a CodePage” property. This gives the code page used for the string returned by the “AsString” property.

In Practice

New applications using IBX should just work “out of the box” as regards character set handling and, in most cases, you will have little need to understand the above unless you are having to process the character set type information for each column.

Existing applications should continue to work without a problem. However:

•If you want to make use of the correct DisplayWidth, or
•If you want the original behaviour of DisplayTextAsClassName, or
•You want to process the character set type information for each column
•And you have used the IDE Field's Editor to create field objects as part of a form

then you will need use the Fields Editor to remove the existing field object for each affected text field and to add it back again. This will create a field object using the new IBX subclassed field type. You may then make use of the extended properties.