Emping is a utility that derives heuristic rules from nominal data. Nominal data are qualitative and unordered, as in:
Class is actually an ordinal attribute, but when the order is disregarded, it is nominal.
Heuristic rules consist of attribute values (predicates) that together imply another attribute value. For example:
Heuristic rules are purely empirical, with no foundation in a theory or model. The input of Emping is just a table of nominal facts. The user has to select which attribute is to be the consequent. Then Emping derives all shortest rules which, in the table, imply the values of the selected consequent. Each reduced rule is a generalization of one or more original rules, and therefore reduced rules may imply other reduced rules or be equivalant to others. If this is the case, these logical dependencies are also derived.
Emping reads a file in a comma seperated format (.csv) as produced by the Open Office Calc spreadsheet, and returns the results as .csv files that can be read by OO Calc.
Emping 0.4 has a GUI and you can start it like any other application.
The general idea is straightforward; see the example for details. First, you check or uncheck in the options menu whether you want to check the data for duplicates. These have no effect on the result, but will slow the program down, and they are automatically removed if the option is selected (but not from the source file itself). If there are duplicates you can see which, with their frequencies, by saving them.
Next you select the consequent attribute from a popup menu. If you have the relevant options menu item checked, emping will look for ambiguous rules for the selected consequent. These are rules which are the same, except for the consequent value. Emping works seamlessly with ambiguous rules (but see the included white paper "Deriving Heuristic Rules from Facts" for how they effect the results.)
Pressing the Reduce button will then start the reduction. Depending on the data set, this may take a while, during which time the file save window will remain open. For small data sets the effect will not be noticeable, but do not try to close it unless you want to abort the reduction.
Reduced rules may themselves imply, or be equivalent to, other reduced rules with the same consequent value. The default Top button will show only the top level of these rules (saved in a .csv file) but you can get all interdependencies through the Abduce All menu item.
The reduced normal form file, the top level rule file and the others (if present), can now be loaded into OO Calc.
More about the principles on which emping is based can be found in the mentioned white paper "Deriving Heuristic Rules from Facts". The general idea is also illustrated by: Fishing
Enter the data in Open Office Calc as shown:
Save the spreadsheet table in Text CSV format. Choose double quotes as the text delimiter (default). Whole numbers will be stored without delimiters, and emping will use them after checking if they are all digits (no negatives, no fractions). Names within quotes should not contain special characters, only letters, possibly numbers and white space.
Just double click on the emping file or type the program name in the command line. Then open the source file and apply the toolbar sequence.
This is what you see when the reduction is finished with "Class" as the consequent.
There are 165 dependency trees, and 5 rules which are not implied by others (except for possible equivalences, which will be shown in the .csv file).
You can now choose to save only the top level, all dependencies, or both.
View the reduced normal form in a .csv file in OO Calc.
View the top level reduced rules, including equivalences, in your saved file. Equivalences, shown as 'equals', are reduced rules which imply each other, both ways.
If you have chosen to save them, all logical implications are shown in that file.
This file shows the partial order of all the reduced rules. Each inference chain, including any equivalences, is shown seperately. In this case the number of lines was too large to fit into the spreadsheet.
The distribution comes with two example data files, Zoo and Mushrooms, both courtesy of the UCI Machine Learning Repository . Thanks to: Asuncion, A. & Newman, D.J. (2007). UCI Machine Learning Repository [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, School of Information and Computer Science.
The Zoo data file is by courtesy of Richard Forsyth.
It lists 101 kinds of animals with 18 nominal attributes
(including the names). The original data have been altered by replacing the two
instances of "frog" by "frog1" and "frog2",
respectively, and "girl" by "giri". All numerical
boolean instances in the original have been replaced by
"Yes" and "No".
The number of reduced rules with "Type" as the consequent attribute is 823.
These fit into 9 top level equivalence classes. On a 2GHz machine the reduction
took seconds to finish, but the time depends on the structure in the data, and is different for
different attributes.
The Mushroom data file is by courtesy of Jeff Schlimmer, who lists its origin as
drawn from The Audubon Society Field Guide to North American Mushrooms (1981).
G. H. Lincoff (Pres.), New York: Alfred A. Knopf
The example consists of 8124 cases and 23 attributes, including "Class", which has values
"poisonous" and "edible".
The original file is coded by letters, according to a supplied legend, but this coding has
been reverted to the original names for readability. The missing value mark "?" has been replaced
by "missing".
The number of reduced rules for "Class" is 3635, and there are 165 dependency trees and 5 Unconnected rules.
The number of all implications and equivalences was more than 65535, too many for OO Calc to load.
Parsing the source data and checking for duplicates (none present) took half a minute, checking
for ambiguities (none present) also took half a minute. The subsequent reduction (consequent "Class")
took 9 minutes and getting the top level then took a few seconds. Getting all dependencies,
resulting in a 150 MB large file, took 4 minutes more.
The Emping utility is written in Haskell, and has been
developed and tested on the Fedora Core Linux platform, using
the Haskell tools and libraries which are available as FC packages.
Version 0.4 has been developed on FC8 with GHC 6.8.2 and Gtk2Hs 0.9.12.
Earlier versions of GHC will probably not compile, and Gtk2Hs versions earlier than 0.9.12
don't support the implemented menu fields, and will not work.
But GHC and Gtk2Hs are implemented on many platforms, including Windows and Linux versions, so Emping-0.4 should also work on those platforms.
Any comments, bug reports, feature requests or remarks will be most welcome.
Emping stands for empirical reasoning or the Indonesian snack with that name.
For large data files like the Mushroom collection, which take a noticeable time to process, windows will remain open and the value selection popup may hang. This will correct itself when the processing is finished. Do not abort until you are sure something is wrong.