Showing posts with label ORM. Show all posts
Showing posts with label ORM. Show all posts

Tuesday, 21 October 2008

Erlang Mnesia - Take 2

After my first attempt of using Erlang Mnesia, I decided to apply what I have learned so far to refactor and improve my Erlang Mnesia implementation of the Address database and Data Access Layer (DAL).

Redesigning The Database

This time I de-normalised part of the database design -
  1. to duplicate the hierarchy_order field from address_field_type to address_field table
  2. removed the record id fields from the 3 entities:
    • address_field_type is a reference data table with small amount of data. I use the locale_country + hierarchy_order as a combined key implicitly
    • address and address_field both have the location_code field which is unique for each record, so I used location_code as their primary key
The address.hrl file:
-record(address_field_type, {
 locale_country,
 name,
 default_value,
 hierarchy_order,
 display_order,
 validation_rules,
 suffix} ).
 
-record(address_field, {
 location_code,
 value,
 locale_country,  %% denormailsed from address_field_type
 hierarchy_order  %% denormalised from address_field_type
 }).
 
-record(address, {
 location_code,
 status
 }).
 
-record(address_addressField, {
 address_locationCode,
 address_field_locationCode
 }).
Creating the database:
create_db() ->
 mnesia:create_table(address_field_type,
  [
   {type, bag}, %% use bag because no single-filed primary key now
   {ram_copies,[nonode@nohost]},
   {index, [hierarchy_order]},
   {attributes, record_info(fields, address_field_type)}
  ]),
 mnesia:create_table(address_field,
  [
   {ram_copies,[nonode@nohost]},
   {attributes, record_info(fields, address_field)}
  ]),
 mnesia:create_table(address,
  [
   {ram_copies,[nonode@nohost]},
   {attributes, record_info(fields, address)}
  ]),
 mnesia:create_table(address_addressField,
  [
   {type, bag},
   {ram_copies,[nonode@nohost]},
   {index, [address_field_locationCode]},
   {attributes, record_info(fields, address_addressField)}
  ]).

The above changes do reduce the functionality of the Address application very slightly but it's a worthy trade-off as it simplifies the design - now I don't need the oid table/record any more along with all the functions and code revolving the oids.

Also, by de-normialising, i.e. duplicating the hierarchy_order field into address_field table, the sorting of the address_field list is made easy and efficient (without having to look up the database in every iteration). This is demonstrated in the code below.

Simplified Sorting

The sorting of address_field list has reduced from 3 functions previously to the use of lists:keysort/2 due to the duplication of the hierarchy_order field from address_field_type into address_field in the data/entity model.
sort_afs(Afs) when is_list(Afs) ->
 %% 5th field of the address_field tuple is hierarchy_order
 lists:keysort(5, Afs).

Simplified Data Insertion

By skipping any record id handling, the code for insertion of records is considerably simplified. This is most evident from the insert_address/2 function (compared to the previous version):
insert_aft(Aft) when is_record(Aft, address_field_type) ->
 Fun = fun() ->
  mnesia:write(Aft)
 end,
 mnesia:transaction(Fun),
 Aft.
 
insert_address(A, Afs) when is_record(A, address) and is_list(Afs) ->
 {NewA, NewAfs} = address:generate_location_code(A, Afs),
 Fun = fun() ->  
  % create the new address record
  mnesia:write(NewA),
  
  % now insert/update into address_field 
  % and insert into address_addressField table
  lists:foreach( fun(Af) ->
    mnesia:write(Af),
    A_Af=#address_addressField{ 
     address_locationCode=NewA#address.location_code,
     address_field_locationCode=Af#address_field.location_code
    },
    mnesia:write(A_Af)
   end,
   NewAfs),
  {NewA, NewAfs}
 end,
 mnesia:transaction(Fun).

Simplified Data Queries

By using dirty_read and dirty_index_read functions unnecessary transactions are avoided.
find_afts(LocaleCountry) when is_list(LocaleCountry) ->
 mnesia:dirty_read({address_field_type, LocaleCountry}).

find_address(Id) when is_integer(Id) ->
 [A] = mnesia:dirty_read({address, Id}),
 A.
 
find_address_field(Id) when is_integer(Id) ->
 [Af] = mnesia:dirty_read({address_field, Id}),
 Af.
 
find_address_field_codes(A_code) ->
 % the read will return a list of tuples, we want the 3rd field (the AF_locationCode) 
 % of each tuple and put them in a list.
 [ element(3, A) || A <- mnesia:dirty_read({address_addressField, A_code}),
  true ].
 
find_address_codes(Af_code) ->
 F = fun() ->
  mnesia:select(address_addressField, [{
   #address_addressField{ address_locationCode='$1',
    address_field_locationCode=Af_code, 
    _='_' },
    [], ['$1'] }] 
  )
 end,
 {atomic, Results}=mnesia:transaction(F),
 Results.
 
find_address_codes2(Af_code) ->
 F = fun() ->
  Q = qlc:q([Aaf#address_addressField.address_locationCode
   || Aaf <- mnesia:table(address_addressField),
   Aaf#address_addressField.address_field_locationCode==Af_code]),
  qlc:e(Q)
 end,
 {atomic, Results}=mnesia:transaction(F),
 Results.
 
find_address_codes3(Af_code) ->
 Aafs = mnesia:dirty_index_read(address_addressField, Af_code, 
  #address_addressField.address_field_locationCode),
 % the second element of the tuple is the address_locationCode
 [ element(2, Aaf) || Aaf <- Aafs, true ].
In the above code, find_address_codes/1, find_address_codes2/1, find_address_codes3/1 do the same thing but are implemented using different approaches. The dirty read one is the simplest.

Conclusion

Just by thinking in Erlang, the resulting database design has been simplified - less table and fields; the source lines of code (SLOC) has been reduced by 30%; most importantly, the programming logic is now simpler and easier to read/maintain.

However, this does not mean that Erlang and Mnesia are the best way to implement this kind of database applications (where normalisation is important for the business logic).

Saturday, 18 October 2008

Erlang Mnesia

Carrying out my Erlang learning plan, I embarked on the porting exercise of my Address DAO package into Erlang Mnesia. In fact, it is more than porting of the data access functions. It is also a porting of the Address database itself from RDBMS into Mnesia.

Creating the Mnesia Schema

The Entity Relationship Diagram (ERD) of the database schema is shown below.

To create the tables in Mnesia, I defined the following records, each corresponding to a table.

-record(address_field_type, {id,
 name,
 locale_country,
 default_value,
 hierarchy_order,
 display_order,
 validation_rules,
 suffix} ).
 
-record(address_field, {id,
 value,
 address_field_type_id,
 location_code}).
 
-record(address, {id,
 status,
 location_code}).
 
-record(address_addressField, {
 address_id,
 address_field_id
 }).
-record(oids, {
 name,
 id
 }).

Note that there is an extra record oids. This record/table is required because Mnesia does not support auto-incrementing fields usually used as primary key generation. Therefore this extra table is used to store the current key value - similar to Oracle's sequence.

Creating the tables are shown below. The oids table is also populated with seed values.

create_db() ->
 mnesia:create_table(address_field_type,
  [
   {ram_copies,[nonode@nohost]},
   {attributes, record_info(fields, address_field_type)}
  ]),
 mnesia:create_table(address_field,
  [
   {ram_copies,[nonode@nohost]},
   {index, [location_code]},
   {attributes, record_info(fields, address_field)}
  ]),
 mnesia:create_table(address,
  [
   {ram_copies,[nonode@nohost]},
   {index, [location_code]},
   {attributes, record_info(fields, address)}
  ]),
 mnesia:create_table(address_addressField,
  [
   {type, bag},
   {ram_copies,[nonode@nohost]},
   {index, [address_field_id]},
   {attributes, record_info(fields, address_addressField)}
  ]),
 mnesia:create_table(oids,
  [
   {attributes, record_info(fields, oids)}
  ]),
 Fun = fun() ->
  mnesia:write(#oids{ name=address, id=0 }),
  mnesia:write(#oids{ name=address_field_type, id=0 }),
  mnesia:write(#oids{ name=address_field, id=0})
 end,
 mnesia:transaction(Fun).

Simple CRUD

The address_field_type is a table for storing reference data, which does not rely on other tables. So the C and R of the CRUD is very simple. From address_db.erl file:
%% returns {atomic, Oid} or {aborted, Reason}
generate_oid(TableName) when is_atom(TableName) ->
 Fun = fun() ->
  [Oid] = mnesia:read(oids, TableName, write),
  %% because Erlang only supports single assignment
  %% I have to create new variables every time the value changes.
  NewId=Oid#oids.id+1,
  New = Oid#oids{ id=NewId },
  mnesia:write(New),
  NewId
 end,
 mnesia:transaction(Fun).
  
insert_aft(Aft) when is_record(Aft, address_field_type) ->
 Fun = fun() ->
  {atomic, Id}=generate_oid(address_field_type),
  New = Aft#address_field_type{ id=Id },
  mnesia:write(New),
  New
 end,
 mnesia:transaction(Fun).

find_aft(Id) when is_integer(Id) ->
 {atomic, [Aft]} = mnesia:transaction(fun()-> mnesia:read({address_field_type, Id}) end),
 Aft.
There are a few things worth noting:
  1. Although the mnesia:read and mnesia:select functions can be read-only, they still need to be wrapped in a function and called by mnesia:transaction/1. This does not make sense to me.
  2. The mnesia:transaction(Fun) returns {aborted, Reason} or {atomic, Result} where Result is whatever the return result of the Fun is. Therefore, the insert_aft/1 above returns New - the new address_field_type record with the id field populated.
  3. Erlang is single assignment language - i.e. a variable can only be assigned value once. Therefore, if I want to assign the Aft#address_field_type.id, I have to create a whole new address_field_type record as in the insert_aft/1 above. I have to say, I am not a fan of this single assignment - it defeats the half of the purpose of having variables.

A Slightly More Complicated CRUD

The address entity relies on address_field, which in turn depends on address_field_type. Also, if address records share the same address_field values, then the address_field record should be shared and not be duplicated in the database. Therefore, the management of the address records are more complicated than those of address_field_type.

The creation of address records is shown below.

insert_address(A, Afs) when is_record(A, address) and is_list(Afs) ->
 {A2, Afs2} = address:generate_location_code(A, Afs),
 Fun = fun() ->
  % populate the address_field records' id field
  NewAfs=lists:foldl( fun(Af, NewAfs) ->
    % check if the same Af is already in database
    Result = mnesia:select(address_field, [{
     #address_field{ id='$1',
      location_code=Af#address_field.location_code, 
      _='_' },
     [], ['$1'] }] 
    ),
    if length(Result)==0 -> % Af not in DB, so insert it
     {atomic, Id}=generate_oid(address_field);
    true -> % Af already exists in database
     Id=lists:nth(1, Result)
    end,
    NewAf = Af#address_field{ id=Id },
    lists:append(NewAfs, [NewAf])
   end,
   [],
   Afs2),
  
  % create the new address record
  {atomic, AddressId} = generate_oid(address),
  NewA = A2#address{ id = AddressId },
  mnesia:write(NewA),
  
  % now insert/update into address_field 
  % and insert into address_addressField table
  lists:foreach( fun(Af) ->
    mnesia:write(Af),
    A_Af=#address_addressField{ 
     address_id=AddressId,
     address_field_id=Af#address_field.id
    },
    mnesia:write(A_Af)
   end,
   NewAfs),
  {NewA, NewAfs}
 end,
 mnesia:transaction(Fun).
From the address.erl file:
%% applying quick sort to AddressField list.
sort_afs([]) -> [];
sort_afs([Pivot|Tail]) when is_record(Pivot, address_field) ->
 sort_afs([Af || Af <- Tail, 
  compare_aft(address_db:find_aft(Af#address_field.address_field_type_id),
   address_db:find_aft(Pivot#address_field.address_field_type_id)) < 0])
  ++ [Pivot] ++
  sort_afs([Af || Af <- Tail, 
  compare_aft(address_db:find_aft(Af#address_field.address_field_type_id),
   address_db:find_aft(Pivot#address_field.address_field_type_id)) >= 0]).

compare_aft(Aft1, Aft2) when 
 is_record(Aft1, address_field_type) and is_record(Aft2, address_field_type) ->
 Aft1#address_field_type.hierarchy_order - Aft2#address_field_type.hierarchy_order.

generate_location_code(A, AddressFields) when is_record(A, address) 
 and is_list(AddressFields) and (length(AddressFields)>0) ->
 Afs=generate_location_code(AddressFields),
 FirstAf=lists:nth(1, Afs),
 { 
  A#address{ location_code=FirstAf#address_field.location_code },
  Afs
 }.

generate_location_code(Afs) ->
 ReverseSortedAfs=lists:reverse(sort_afs(Afs)),
 generate_location_code(ReverseSortedAfs, [], []).
 
generate_location_code([Head|Tail], String, NewAfs) when is_record(Head, address_field) ->
 StringValue = Head#address_field.value ++ String,
 Code=erlang:phash2(StringValue),
 generate_location_code(Tail, 
  StringValue,
  [ Head#address_field{ location_code=Code } ] ++ NewAfs);  
generate_location_code([], _String, NewAfs) ->
 NewAfs.   
The following functions are queries from the address_db.erl file:
find_address(Id) when is_integer(Id) ->
 {atomic, [A]} = mnesia:transaction(fun()-> mnesia:read({address, Id}) end),
 A.
 
find_address_field(Id) when is_integer(Id) ->
 {atomic, [Af]} = mnesia:transaction(fun()-> mnesia:read({address_field, Id}) end),
 Af.
 
find_address_field_ids(A_id) ->
 mnesia:transaction(fun()-> mnesia:read({address_addressField, A_id}) end).
 
find_address_ids(Af_id) ->
 F = fun() ->
  mnesia:select(address_addressField, [{
   #address_addressField{ address_id='$1',
    address_field_id=Af_id, 
    _='_' },
    [], ['$1'] }] 
  )
 end,
 mnesia:transaction(F).
 
find_address_ids2(Af_id) ->
 F = fun() ->
  Q = qlc:q([Aaf#address_addressField.address_id 
   || Aaf <- mnesia:table(address_addressField),
   Aaf#address_addressField.address_field_id==Af_id]),
  qlc:e(Q)
 end,
 mnesia:transaction(F).
Things that I want to highlight from the above are:
  1. Unlike Java or C#, Erlang does not have a sortable list out-of-the-box. Instead, I had to write functions that sort my list of address_field records - these are: sort_afs, compare_aft
  2. When searching using the primary key field, mnesia:read is used; when searching using other fields, either use mnesia:select or QLC. In the above code, the functions find_address_ids/1 and find_address_ids2/1 do the same thing. However, the result (a list) given by find_address_ids2/1 are reversed.
  3. If a record with the same primary key already exists in the table, the mnesia:write either creates (if table type if set) or updates (if table type is bag) the record in the table.
  4. Again, due to the single assignment rule, the lists:append(List, AnotherList) does not change the List; instead, the new list is the return value of lists:append/2. This is just counter-intuitive.

Related Posts:

Monday, 13 October 2008

Erlang ODBC

Continuing my Erlang journey, I decided to try with Erlang ODBC by porting one of my toy applications. Perhaps not a wise choice since working with database (especially relational database) is not a strength of Erlang, at least for now.

After almost 10 years and being spoiled by myriad of Object-Relational Mapping (ORM) frameworks, using ODBC again feels like a giant leap backward - 3 generations back to be exact (ADO.Net, ADO, ODBC). So the overall development experience is not that great. This is not to say that Erlang is not a great language, but Erlang as a platform is quite narrow in scope (this makes sense considering Erlang's strong telco heritage).

So here is my porting exercise. Note that the code below is my feeble attempt at using/learning Erlang by reading the man-pages alone, so if you have suggestions to improve it, please drop a comment. Also, it is not robust since it does not handle error conditions.

Background

The application (or component) that I am working on is a Data Access Objects (DAO) package which I originally developed in Java using various ORM frameworks and then C# using various 3rd-party ORM and ADO.NET frameworks. The data model is quite simple, consisting 3 entities: AddressFieldType (AFT for short), AddressField (AF for short) and Address. The relationships among them are as following:

  • AFT - AF has a one-to-many relationship
  • Address - AF has a many-to-many relationship

The partial entity relationship diagram (ERD) can be found in my previous blog.

Here, I attempt to implement some of the interfaces from the DAO package using Erlang ODBC.

The Design

Since Erlang is not an object oriented language, I cannot call my module DAO, so I call it DAL (Data Access Layer) instead borrowing from Microsoft terminology. I have two simple modules:

  1. address_dal: implementation of some of the interfaces retrieving records from the Address database (using MySQL 5.x with ODBC driver 5.1.5 for win32 downloaded from MySQL).
  2. orm: ORM here means ODBC-Record Mapping. There are two types of functions in this module: those that provide ODBC template functions; and those that convert ODBC returned data into Erlang Records defined in address.hrl file.

Records

I want to work with Erlang records rather than tuples returned by ODBC calls. So I created these records to represent the domain model. From the address.hrl file:

-record(address_field_type, {id,
 name,
 locale_country,
 default_value,
 hierarchy_order,
 display_order,
 validation_rules,
 suffix} ).
 
-record(address_field, {id,
 value,
 address_field_type_id,
 location_code}).
 
-record(address, {id,
 status,
 location_code,
 address_fields = [#address_field{}]}).

Notice the address record has a list of address_field records, representing half of the many-to-many relationship.

ODBC Template

I need to surround my SQL queries with database connection and disconnection so that these boiler plate codes do not get scattered everywhere in my business logic. Also, if the SQL queries results in error, Erlang breaks the ODBC connection (I am not sure if this is Erlang's fault or ODBC's). So I have to reconnect before my next SQL query anyway.

So in my orm.erl I have these ODBC template/boiler-plate functions:

-define(CONNECTION_STRING, "DSN=Address-MySQL;UID=root").

connect() ->
 case proplists:is_defined(odbc,application:which_applications()) of
  false ->
   application:start(odbc); % pre R12 way of starting odbc;
  _Else ->
   false % do nothing
 end,
 odbc:connect(?CONNECTION_STRING, [
  {auto_commit, off}, 
  {scrollable_cursor, off} 
 ]).
 
sql_query(Query) ->
 {ok, Ref}=connect(),
 ResultTuple = odbc:sql_query(Ref, Query),
 odbc:disconnect(Ref),
 ResultTuple.
 
param_query(Query, Params) ->
 {ok, Ref}=connect(),
 ResultTuple = odbc:param_query(Ref, Query, Params),
 odbc:disconnect(Ref),
 ResultTuple.

Note that Erlang ODBC is based on ODBC v3 which supports connection pooling which has been turned on. So surrounding the SQL queries with connection and disconnection should not incur performance penalty (although I have not explicitly tested it).

One thing I hate about Erlang's odbc:connect() is that it does not give any comprehensive error reasons if the connection fails - it always says "No SQL-driver information available. Connection to database failed." Again, I am not sure if this is Erlang's or ODBC's fault.

Converting into Records

The ODBC query functions return the results as tuples or list of tuples. I want to convert them into recdords so that I can access the fields of the records more easily - e.g. #address.id.

I tried two approaches implementing the record constructor functions. The first one is more portable since it does not rely on the field position but uses column name as a clue to map to the record's fields. This is shown below from the orm.erl file mapping the AddressFieldType query results to their corresponding address_field_type record type:

construct_aft_records(ColNames, Rows) when erlang:is_list(ColNames) and (length(ColNames)>0)
 and erlang:is_list(Rows) ->
 if (length(Rows)>0) ->
  lists:foldl(fun(R, AftRecords) ->
    lists:append(AftRecords, [construct_aft_record(ColNames, R)])
   end,
   [], % initial value of AftRecords list
   Rows
  );
 true->
  []
 end.

   
construct_aft_record(ColNames, Row) when erlang:is_list(ColNames) and (length(ColNames)>0)
 and erlang:is_tuple(Row) and (size(Row) == length(ColNames)) ->
 Map=lists:foldl(fun(N, Map) ->
  lists:append(Map, 
   [{string:to_upper(lists:nth(N, ColNames)), 
   element(N,Row)}]
  )
  end,
  [], % initial value of Map is []
  lists:seq(1, length(ColNames))
 ),
 #address_field_type{
  id=extract_map_value(Map, "ADDRESSFIELDTYPEID"),
  name=extract_map_value(Map, "NAME"),
  locale_country=extract_map_value(Map, "LOCALECOUNTRY"),
  default_value=extract_map_value(Map, "DEFAULTVALUE"),
  hierarchy_order=extract_map_value(Map, "HIERARCHYORDER"),
  display_order=extract_map_value(Map, "DISPLAYORDER"),
  validation_rules=extract_map_value(Map, "VALIDATIONRULES"),
  suffix=extract_map_value(Map, "SUFFIX")
 }.

 
extract_map_value(Map, Key) ->
  element(2, element(2,lists:keysearch(Key, 1, Map))).

The second approach is lazier as it relies on the field position as specified in the SELECT SQL statement. This is shown below from the orm.erl file.

find_addresses(Query, Params) ->
 {ok, Ref}=connect(),
 {selected, _, Rows} = odbc:param_query(Ref, Query, Params),
 % now for each address Row fetch its addressField records
 AddressRecords = lists:foldl(fun(A, Records) ->
   AddressId=element(1,A),
   AddressStatus=element(2,A),
   AddressLocationCode=element(3,A),
   {selected, _, AfRows}=odbc:param_query(Ref,
    "SELECT af.addressFieldId, af.value, af.addressFieldTypeId,"
    "       af.locationCode, aft.hierarchyOrder"
    " FROM Address.addressField as af, Address.AddressFieldType as aft,"
         "      Address.address_addressField as aaf"
    " WHERE aaf.addressId=?"
      "   AND af.addressFieldId=aaf.addressFieldId"
      "   AND aft.addressFieldTypeId=af.addressFieldTypeId"
            " ORDER BY aft.hierarchyOrder",
            [{sql_integer, [AddressId]}]
           ),
           AfRecords = lists:foldl(fun(Af, AddressFieldRecords) ->
             AfRecord=#address_field{
              id=element(1, Af),
      value=element(2, Af),
      address_field_type_id=element(3, Af),
      location_code=element(4, Af)
             },
             lists:append(AddressFieldRecords, [AfRecord])
            end,
            [],
            AfRows
           ),
           AddressRecord= #address{
            id=AddressId,
    status=AddressStatus,
    location_code=AddressLocationCode,
    address_fields=AfRecords
   },
   lists:append(Records, [AddressRecord])
          end,
  [],
  Rows
 ),
 odbc:disconnect(Ref),
 AddressRecords.

Implementing the DAL Interfaces

Now that the boiler-plate code is done, I can implement the DAL interface methods, oops! I mean functions. Here is the address_dal.erl file.

%% Implementation of the Address Data Access Layer (DAL) interfaces.
-module(address_dal).
-export([find_aft/0, find_aft/1]).
-export([find_addresses_in_af/1, find_addresses_in_location/1]).
-include("address.hrl").
 
find_aft() ->
 {selected, ColNames, Rows} = orm:sql_query("SELECT * from Address.AddressFieldType"),
 orm:construct_aft_records(ColNames, Rows).
 
find_aft(LocaleCountry) ->
 {selected, ColNames, Rows} = orm:param_query("SELECT * from Address.AddressFieldType "
  "WHERE localeCountry=?", 
  [{{sql_varchar, 64}, [LocaleCountry]} ]
 ),
 orm:construct_aft_records(ColNames, Rows).


find_addresses_in_af(AddressFieldId) ->
 orm:find_addresses( 
  "SELECT a.addressID, a.status, a.locationCode"
  " FROM Address.Address as a, Address.Address_AddressField as aaf"
  " WHERE aaf.ADDRESSFIELDID = ?"
  " AND a.ADDRESSID=aaf.ADDRESSID",
  [{sql_integer, [AddressFieldId]}] 
 ).
 
% the input parameter LocationCode needs to be a string due to the ODBC 
% to Erlang datatype mapping - hence the test for is_list().
find_addresses_in_location(LocationCode) when is_list(LocationCode) ->
 orm:find_addresses("SELECT a.addressID, a.status, a.locationCode"
  " FROM Address.Address as a, Address.AddressField as af, Address.Address_AddressField as aaf"
  " WHERE af.LOCATIONCODE = ?"
  "   AND a.ADDRESSID=aaf.ADDRESSID"
         "   AND af.ADDRESSFIELDID=aaf.ADDRESSFIELDID",
  [{{sql_numeric, 32, 0}, [LocationCode]}] 
 ).
Notice that in the find_aft functions I can use 'SELECT *...' because the orm:construct_aft_records/2 function uses the column names to map to the address_field_type record; on the other hand, the find_addresses... functions have to use 'SELECT a.addressID, a.status, a.locationCode...' and in that strict order because the orm:find_addresses/2 function expects the columns to be in those positions.

Wednesday, 3 September 2008

The Sorry State of ADO.NET Entity Framework 1.0

A few months ago I test-drove the ADO.NET Entity Framework 1.0 Beta 3. While I enjoyed certain aspects of EF, such as LINQ and graphical modelling, the overall experience of using EF hasn't been that great. In fact, my very first blog on this site was dedicated to EF bashing. My major complaint about EF was its intrusive approach to ORM - instead of using PONO (like in Hibernate), it generates bloated relational database-centric data entity models. This is against the domain driven design method. Another problem I encountered was its shoddy implementation of lazy loading (or lack of it, as a matter of fact).

I then realised that I was not alone in feeling disenchanted by EF. In the same month of my blog, several hundred people have signed up on the ADO.NET Entity Framework Vote of No-Confidence open letter, many of whom are from Microsoft's Most Valued Partners (MVP). In last month's issue of Visual Studio Magazine, the article A Vote for Transparency gave a blow-by-blow account of events on the controversy surrounding EF, including the aformentioned open letter. So Microsoft decided to make the EF version 2 design exercise a more transparent process. This no doubt is a step forward. However, it is too little too late for EF v1. Looks like EF v1 will not receive any significant improvements or address any of the communities concerns in the open letter. In fact, I doubt that any EF design improvements have been made into .NET Framework 3.5 SP1 (released in August 2008) since EF 1.0 Beta 3 (released in December 2007).

On a side note: although some people whine about JCP program being slow (e.g. the Java closure debate) and sometimes committee-driven, comparing to the traditional closed-door design pattern like EF v1, JCP is a great leap forward and the result speaks for itself. The JCP specifications are more readily accepted and adopted by the community than a proprietary one like EF.

Saturday, 16 August 2008

ORM Frameworks and DBA Friendliness

Database programming used to directly rely on tools provided by the database vendors, such as ESQL-C from Informix, Pro*C from Oracle and Sybase. Using these tools makes the business logic (C/C++ code) mingled with and dependent on the database design and logic (SQL code).

The emergence of Object-Relational Mapping (ORM) frameworks solved this problem by decoupling the code and the database assets. Nowadays, when designing software, it's not a matter of using ORM or not for the designer, but rather which ORM framework to use.

Most ORM or entity frameworks (Hibernate, JPA, Entity Framework) out there generate SQL statements automatically and hide the internals from the developers and rightly so. However, this creates another problem: to optimise the SQL statements, it is usually a joint effort between the developers and DBAs (or data architects) - the DBA should have the final say in how to shape the SQL queries and to design/modify the database schema to suit the usage pattern of the application(s). Now that the SQL statements are auto-generated by the framework and the DBA cannot design or examine the SQL code up front, the control of the DBA is diminished.

This loss of control may not be a problem for custom-built/one-time applications, or applications with a simply data model. However, if you are dealing with a large/complex product that has to be implemented in different environments, then it is crucial to have full control over the database deisgn and deployment and how it is to be used.

To have the visibility to the SQL statements up front, there seems to be two options:

  1. resort back to JDBC/ADO.NET... to mix the SQL statements with the business logic code - not advisable
  2. use a framework that exposes the SQL statements - iBATIS SQLMap

Strictly speaking, iBATIS is not an ORM framework - it maps the objects with SQL statement query input/output. There are two distinct advantages of using SQL Map:

  1. The database-specific code (SQL statements, connection strings, etc.) can be centralised and totally separated from the core (Java/C#...) code. This makes the code decoupled from the database and more portable across different databases;
  2. All database CRUD are specified in native SQL which are DBA friendly, making it easy for the DBA to review and participate in the design of these statements.

Because iBATIS does not map objects with tables, it makes it ideal for applicaitons that have to use legacy databases which did not adhere to object oriented design or have totally different structure from the object model.

Therefore, if you are involved in software product development (as opposed to custom-built one-time projects), DBA friendliness should be a factor to be considered when deciding an ORM framework.

Friday, 27 June 2008

Dynamic Query

I have a prototype application which searches from database for Address records based on keywords. Keywords are entered on the GUI by the user. The number of keywords is unknown beforehand. Dynamic query is therefore used. The corresponding entity model is shown in my other blog. The equivalent SQL statement looks something like this:

select a.*
from Address as a, AddressField as af, Address_AddressField as aaf
where a.ADDRESSID=aaf.ADDRESSID
and aaf.ADDRESSFIELDID=af.ADDRESSFIELDID
and (upper(af.VALUE) like 'BOMBAY'
or upper(af.VALUE) like 'MUMBAI'
or ... /* more parameters */ )

(Note that in a production system, it is better to have an additional column in the database table to store the uppper-cased values. That way, they can be indexed to improve query performance.)

Here, I will show how various ORM frameworks handle this type of dynamic query.

iBATIS SQL Map

iBATIS SQL Map is a mapping framework between SQL statements inputs/outputs and the object model. Therefore, SQL statements are used in SQL Map directly. The SQL Map query is quite straight forward and uses the same SQL statement:


  
  
  
  


 select a.*
 from Address as a, AddressField as af, Address_AddressField as aaf
 where a.ADDRESSID=aaf.ADDRESSID
 and aaf.ADDRESSFIELDID=af.ADDRESSFIELDID
 
  upper(VALUE) like #[]#
 

iBATIS has this <iterate> tag to denote that for every element in the input parameter (which is a collection, in this case a 'list') format it into the SQL statement using its various constructs - prepend, open, close, conjunction, etc.

The corresponding Java code calling this query:

public class SQLMapAddressDao extends SqlMapDaoTemplate
implements IAddressDao {
...
 private Address[] getByFieldValues(List<AddressField> afs)
 throws DataAccessException {
  try {
   List list=queryForList("Address.getByFieldValues", afs);
   if(list==null || list.size()==0)
    return null;
   else
    return (Address[]) list.toArray(new Address[list.size()]);
  } catch (Exception e) {
   throw new DataAccessException(e);
  }
 }
...
}

iBATIS also supports .NET so I can execute the same query in .NET. The following C# code does the same thing as the Java snippet above.

public Address[] GetByFieldValues(IList afs) {
IList list = ExecuteQueryForList("Address.getByFieldValues", afs)
            if (list == null || list.Count == 0)
                return null;
            else {
                return (Address[])
                    ((ArrayList)list).ToArray(typeof(Address));
            }
        }

Hibernate

Every ORM framework has its own object query language (OQL). Hibernate uses HQL. Programmatically, building the HQL statement is simply building a string. The HQL statement can programatically built as such:

public Address[] getByFieldValues(AddressField[] afs)
 throws DataAccessException{
  Address[] result=null;
  if(afs==null  afs.length==0)
   return null;
  // this query does not work with mySQL 4.0.x.
  String select="select addr"
   + " from com.laws.Address.Address as addr,"
   + "      com.laws.Address.AddressField as af";
  StringBuffer where=new StringBuffer(" where (upper(af.value) like :value_0");
  for(int i=1; i<afs.length; i++) {
   where.append(" or upper(af.value) like :value_"+i);
  }
  where.append(") and addr in elements(af.addresses)");

  String query=select+where.toString();

  try {
   Session sess = HibernateSessionFactory.currentSession();
   Query q = sess.createQuery(query);
   for(int i=0; i<afs.length; i++) {
    q.setParameter("value_"+i, %"+afs[i].getValue().toUpperCase()+"%");
   }
   List list = q.list();
   if (list != null && list.size() > 0) {
    result = (Address[]) list.toArray(new Address[list.size()]);
   }

   HibernateSessionFactory.closeSession();
  } catch (HibernateException he) {
   HibernateSessionFactory.handleException(he);
  }
  return result;
 }

JPA

JPA has been included as part of EJB 3.0 specification to replace the old EJB Container Managed Persistence (CMP) approach. In JPA the same query in JPQL is programmatically built:

public Address[] getByFieldValues(AddressField[] afs)
   throws DataAccessException {
  if(afs==null  afs.length==0)
   return null;

  String select="select a"
   + " from Address addr,"
   + "      AddressField af, in(af.addresses) a";
  StringBuffer where=new StringBuffer(" where (upper(af.value) like :value_0");
  for(int i=1; i<afs.length; i++) {
   where.append(" or upper(af.value) like :value_"+i);
  }
  where.append(") ");
  String queryString=select+where.toString();

  Query query=EntityManagerHelper.createQuery(queryString);
  // the JPA parameter index is 1-based.
  for(int i=0; i<afs.length; i++){
   query.setParameter(i+1, afs[i].getValue());
  }
  List list=query.getResultList();
  if(list==null  list.size()==0)
   return null;
  else
   return (Address[]) list.toArray(new Address[list.size()]);
 }

The code is almost identical to Hibernate except for the JPQL vs. HQL syntax differences. This is not surprising considering JPA shares its roots with Hibernate.

ADO.NET Entity Framework

EF 1.0 Beta 3 does not support dynamic query (using LINQ to SQL). To have string-based dynamic query, I had to download a sample from MSDN, extract out the Dynamics.cs file from the zip and place it into my project. Then I have the System.Linq.Dynamic namespace available to me.

public Address.Domain.Address[] GetByFieldValues(Address.Domain.AddressField[] afs) {
            if (afs == null  afs.Length == 0)
                return null;
            var aQuery= addressContext.AddressField.Where("1=0").Select(a=>a);
            foreach (Address.Domain.AddressField af in afs) {
                if (af.GetValue().Length > 0) {
                    var aQuery2=addressContext.AddressField
                        .Where("value.ToUpper().Contains(@0)", af.GetValue().ToUpper())
                        .Select(a => a);
                    aQuery=aQuery.Union(aQuery2);
                }
            }
    
            Dictionary<EntityKey, AddressModel.Address> dict=new Dictionary<EntityKey,AddressModel.Address>();
            foreach (AddressModel.AddressField afEntity in aQuery) {
                afEntity.Address.Load();
                afEntity.AddressFieldTypeReference.Load();
                foreach (AddressModel.Address aEntity in afEntity.Address) {
                    if(!dict.ContainsKey(aEntity.EntityKey))
                        dict.Add(aEntity.EntityKey, aEntity);
                }
            }
            return AddressEFHelper.AddressArrayEntityToDomain(dict.Values.ToArray());
        }
This implementation is very clumsy compared to other previous frameworks.
  1. I could not build a single query string. I think it's due to the escaped double-quotes (\") in the string - somehow, it gives me error at runtime
  2. I could not get the Address entities from the first iteration (which gives me AddressFields); I had to get the Addresses from the selected AddressFields in a second iteration (nested foreach loops)
Personally, I prefer iBATIS the best because it gives me the flexibility of writing my own SQL statements. The SQL statements are reviewed by the DBA and provide input for the DBA for database optimisation (as we all know that the database is usually the weakest link in terms of overall system performance).

Tuesday, 17 June 2008

Data Logic and Business Logic

The 3-tier or n-tier design has been widely adopted for very good reasons: separation of concerns; divide and conquer...

One doctrine of the 3-tier design is to treat the data tier as data storage and do not implement any business logic inside your database. However, this does not mean that database should not have any logic. On the contrary, any data specific logic should be implemented in the database, not in the business tier.

I recently read an interesting post on using Hibernate Interceptor classes to access the table in the desired schema. The end result of this implementation is to hide the fact that the application is dealing with tables in multiple database schemas so that SQL queries don't have to be littered with 'schemaName.tableName' everywhere. Of course, a simpler solution to this problem is to create a public Synonym in the database.

In my opinion, this 'table routing' logic should be implemented in the database server as it is database design/deployment specific. On the other hand, Hibernate is an ORM framework and logically belongs to the data tier in the 3-tier architecture, so it is not wrong to do it in there, especially if the database server does not support synonyms.

Another real-life example of data logic is in high volume tables - e.g. in Telco billing applications, there are millions of call event records a day; so it is common practice to partition that call event table. In the earlier days, when database servers did not support table partitioning (e.g. SQL Server 2000), DBAs have been manually segmenting tables (e.g. by creating multiple tables: CallEvent_20080201, and CallEvent_20080202, etc.). Obviously the application should not implement the logic of which table to use when executing a query (e.g. get me all call records for today), but let the database handle it - a common practice is to create database view to consolidate/union the manually partitioned tables. An advantage of implementing it in the database is that when you upgrade or migrate to another database server which supports table partitioning (e.g. SQL Server 2005+, Oracle 8+) your code need not change (OK, if you are using ORM, the mapping file may need some modification).

To make a better design decision, the system designer/architect and the data designer/architect must jointly design the system in a collaborative manner.

Tuesday, 10 June 2008

Entity Framework Bug

In a previous post I blogged about Entity Framework (EF, v1.0 Beta 3) flaws, one of which was about the inconsistency between the entity context and the actual data in the database tables. Here I demonstrate it.

I have two tables: AddressField and Address with a many-to-many relationship as illustrated below.

In the database, the many-to-many relationship is representated as the address_addressfield table. EF correctly generated the entity model, which is shown below. In the entity model, Address has a collection of AddressField and vice versa. To avoid circular loops, lazy loading is used. So far, so good.

Suppose I have 3 addresses a1, a2 and a3, all of which are located in the same country, i.e. all 3 contain the same AddressField. Conversely, the AddressField representing the country should have 3 Addresses in its collection.

If I delete any Address, the corresponding database records in Address and Address_AddressField table should get removed. I would expect the same thing happens in the entity context - i.e. the deleted address should be removed from the AddressField's Address collection.

The following code snippet from my unit test illustrates this:

addressDao.Delete(a1);
Assert.AreEqual(2, addressDao.GetAll().Length);
addressDao.Delete(a2);
addressDao.Delete(a3);
Assert.IsNull(addressDao.GetAll());
Assert.AreEqual(0, 
  afDao.CountAddressReferences(afCountry)); // test fails here
The CountAddressReferences() method implementation is shown below:
public int CountAddressReferences(Address.Domain.AddressField af) {
    int id=(int) af.GetAddressFieldId();
    AddressModel.AddressField entity = addressContext.AddressField
        .Where(t => t.addressFieldId == id).First();
    entity.Address.Load();
    return entity.Address.Count;
}
When I inspect the database tables, everything was OK - the Address table was empty and so was the Address_AddressField table. So the problem is in the entity context - it is not refreshed/updated following the association between Address and AddressField. To get around this problem, I had to detach and reattach the AddressField as shown below (lines 7 to 9):
public int CountAddressReferences(Address.Domain.AddressField af) {
    int id=(int) af.GetAddressFieldId();
    AddressModel.AddressField entity = addressContext.AddressField
        .Where(t => t.addressFieldId == id).First();
    entity.Address.Load();
    Console.WriteLine("before count=" + entity.Address.Count);
    addressContext.Detach(entity);  // extra line
    addressContext.Attach(entity);  // extra line
    entity.Address.Load();          // extra line
    Console.WriteLine("after count=" + entity.Address.Count);
    return entity.Address.Count;
}
Inspecting the console output, it yielded:
before count=3
after count=0

Sunday, 8 June 2008

Entity Framework 1.0 Beta 3

The much anticipated Microsoft ADO.NET Entity Framework (by me at least) has turned out to be a big disappointment.

Entity Framework (EF) is an improvement of the old ADO.NET and supposed to be Microsoft's answer to third-party Object-Relational Mapping (ORM) products, such as Hibernate. However, it totally missed the plot of why people want to have ORM and how these tools are used.

The purpose of ORM is to shield the database details away from the business logic so that the developer does not have to worry about the implementation details of how the data are stored on disk, how the foreign keys relationships are traversed, etc. Developers only need to deal with the domain object model and the domain objects should be modelled as Plain Old .Net Objects (PONO, borrowing from the POJO acronym). This is evident in the tried and true ORM frameworks originated in the Java world: Hibernate, JPA, etc (and to certain extents iBatis SQL Mapper).

The EF way of doing this is to have an Entity Model which can be generated from the database schema. The Entity Model classes inherit from System.Data.Objects.DataClasses.EntityObject, which means that these entity objects are not PONO. So if you want to have a loosely coupled system (between data tier and business tier), then you will need to have a mapping layer between your PONO and these entity objects. This is unnecessary additional work which should have been taken care of by the ORM framework itself.

Another problem with dealing directly with these entity objects is that the developer will have be be aware of the fact that there is a whole new caching layer called entities context which caches the database records in memory in the form of these entity objects. Very often, the developer will have to worry about the synchronisation between the entities context and the actual database (by calling Attach(), Detach() and Load() methods explicitly on the entity model objects) especially when many-to-many relationship associations need to be followed on Delete or Update operations. See my other post on an example of this. Microsoft has also been evangelising the practice of putting these entity objects directly into GUI widgets as their data source. This again creates tight coupling between the data tier and the presentation tier. All these extra data and methods about the entity object life-cycle should be made only visible in the data tier (or DAL, as Microsoft calls it) and nowhere else. That is why PONO should be used throughout the business and presentation tiers, rather than these bloated entity objects. It is disappointing to see that after witnessing all these great ORM examples in the Java world, Microsoft still could not make a decent ORM framework.