Expertsystemen

Optimisation with datasets

This article describes at a (very) technical level how the servers handles datasets. Datasets can hold a lot of data, making working with them sometimes a bit slow. By understanding what datasets are and how the server handles them, the author can make a model much less memory and CPU heavy.

Datasets

A dataset is just a variable, only with a specific XML structure. A dataset has the same structure as a graph, with iterations, nodes and data. A dataset is defined by a graph. Example: there is a graph called gstructure, and it has a node data, and that node has two variables, firstname and lastname. You can create a variable mydataset of the type gstructure. You can add data to mydataset by declaring mydataset[0].data.firstname := 'joe'.

The XML in the variable mydataset will then be like this:

<dataset name="">
  <graph iteration="0" name="gstructure" tuplestatus="new" visibility="visible">
    <node name="data">
      <data name="firstname">joe</data>
      <data name="lastname"></data>
    </node>
  </graph>
</dataset>

The element graph has 4 parameters. The iteration, the name of the graph the structure is based on, and tuplestatus and visibility. Tuplestatus is used to see if the iterations are added to the dataset. This information can be used to update a database. The visibility can be visible or invisible after a select. This means that if you have a big dataset and use a select, the data itself is still there. As an author you do not have to think about that, because the invisible iterations are hidden.

So a dataset is just text as XML. To work with a dataset, for instance say myname := mydataset[0].data.firstname, the server has to understand that XML structure. Internally, the server translates the XML to a structure. In a variable (e.g. myname) there is the XML and the structure. The translation of XML to the structure and structure to XML is CPU intensive, and that is why we should avoid it, if possible. The variable will know if the XML or the structure should be used if the data is accessed.

transforming data

If a dataset is created, e.g. mydataset := graphtodataset(gstructure), the graphtodataset will generate the XML, and the := directive will take the XML and put it in mydataset. Now, the mydataset variable has just the text XML, no structure yet, because there is no need. If you would say mysecondset := mydataset, the text XML of mydataset is taken, and put in mysecondset. Both variables will not have a structure.

But now we want to know how many iterations there are in mydataset. So we say iterations := count(^mydataset). Now, the server needs to understand the XML, so the XML is converted to the structure. The structure is now equal to the XML. The count looks at the structure and can easily return the number. If you would say mysecondset := mydataset the XML is just returned, because structure and XML are still the same.

After the count, i want to add a lastname to joe. We use mydataset[0].data.lastname := 'doe'. The := sets ‘doe’ to the variable. The variable already has the structure, and sets the value to lastname. The structure is now preferable to the XML. But now, if i say mysecondset := mydataset, the structure will be converted to XML and then returned to the :=. So the assignment to mysecondset before the count was much faster then it is now. After the conversion XML and structure are equal again.

optimising the use of datasets

Take a look at the following code:

1  i := 0;
2  repeat
3    chk_myset := myset;
4    chk_myset := select (chk_myset, (chk_myset.content.language = myset[i].content.language);
5    if count(^chk_myset) > 1 then
6      foutcode := 'double language';
7    i := i + 1;
8  until (foutcode <> '') OR (i = count(myset));

The dataset will be checked for double values in chk_myset.content.language. On line 3 the chk_myset gets an XML value. The structure will be empty. The select on line 4 will take the XML of chk_myset, create the structure, does the select, and return the XML, by converting the structure. On line 5 the XML is converted to a structure, to return the number. On line 8, the count(myset) will create the structure, and return a number. Now it goes back to line 3, setting the XML again, so ignoring the structure, just to create it again in the select (and back to XML). So in this code, per iteration, the XML to structure and structure to XML conversion is run 3 times.

The new function added is selectonset. We now take the dataset (line 1) and make sure we only work with the structure. The function selectonset will use the structure, and doesn’t need to convert to XML. We do have to ‘fix’ the reset, in the first example line 3. After a select only the iterations that meet the condition are set to ‘visible’. To make sure all rows are set back to visible so a new select can take place, we need to reset the structure by using resetdatasetonset. Which is the same function as resetdataset, only again, works directly on the structure. See line 4.

1  chk_myset := myset;
2  i := 0;
3  repeat
4    resetdatasetonset(^chk_myset);
5    selectonset (chk_myset, (chk_myset.content.language = myset[i].content.language);
6    if count(^chk_myset) > 1 then
7      foutcode := 'double language';
8    i := i + 1;
9  until (foutcode <> '') OR (i = count(myset));

purging

As seen above, selecting on datasets keep the datasets big. Filtered iterations still exist. You can use purgedataset to remove those invisible iterations, making the dataset a lot smaller. You can also use this directly on a select. selectpurge will return the XML without the invisible iterations (resetdataset will then be useless).

Next to selectpurge there is also selectpurgeonset. This will do the select on the structure and delete the invisible rows.