Disclaimer: The content of this page is automatically fetched from the PrefLib-Data repository. The reference file is: FORMAT_SPECIFICATION.md.

Format Specification

This is the full specification of the PrefLib data format. The following should be considered the reference text and the different implementation of the PrefLib ecosystem should follow this specification.

Links:

Datasets

A dataset is a (zipped) folder containing data files, an info.txt file and potentially a metadata.csv file. The name of the folder is typically num - abb where num is the series number of the dataset, and abb its abbreviation.

Dataset info.txt

Every dataset must include an info.txt file. It contains two sections. The first section presents a set of metadata about the dataset, encoded in the format MetadataName: Value. The second session describes the data files of the dataset. Its format follows that of a csv file, with comma separator.

Here is an example of the file, taken from the irish dataset.

Name: Irish Election Data

Abbreviation: irish

Tags: Election

Series Number: 00001

Publication Date: 2013-08-17

Description: <p>The Dublin North, West, and Meath data sets contain a complete record of votes...</p>

Required Citations:

Selected Studies: Budgeted Social Choice: From Consensus to Personalized Decision Making; Tyler Lu and Craig Boutilier; Proceedings of IJCAI; 2011

file_name, modification_type, relates_to, title, description, publication_date
00001-00000001.soi, original, , 2002 Dublin North, , 2013-08-17
00001-00000001.toc, imbued, 00001-00000001.soi, 2002 Dublin North, Obtained from the soi by adding the unranked alternatives at the bottom, 2013-08-17
00001-00000002.soi, original, , 2002 Dublin West, , 2013-08-17
00001-00000002.toc, imbued, 00001-00000002.soi, 2002 Dublin West, Obtained from the soi by adding the unranked alternatives at the bottom, 2013-08-17
00001-00000003.soi, original, , 2002 Meath, , 2013-08-17
00001-00000003.toc, imbued, 00001-00000003.soi, 2002 Meath, Obtained from the soi by adding the unranked alternatives at the bottom, 2013-08-17

Let us describe in more details the metadata:

All these metadata are mandatory. The value should be blank if there is no information for a given metadata.

The second part of the file describes all the data files of the dataset in a csv fashion. We detail the headers in the following.

Among all those headers, relates_to and description can be empty. All the others are required to have a value.

Note that if a comma should appear in a field (e.g., in description), the value of the field should be put into triple double quotes: """this is my description, with a comma.""". This also mean that triple quotes should not be used for any other usage.

Dataset Tags

Dataset tags are used to classify datasets based on their characteristics. These are the tags currently in use.

This list is not fixed. Extra tags can be added and current tags can be removed.

Data Files

A data file can be any type of file included in a dataset. Several file formats have been defined for the PrefLib ecosystem. These formats should be preferred over other file formats but it is not mandatory to use them.

Data Types

The PrefLib ecosystem defines 6 different data types.

PrefLib File Format

For each of the data type defined above, a corresponding file format has been defined. All the data file share a common file format, with few adaptions for each specific type.

Data files contain two sections, first a list of metadata with lines starting with a “#”; second the preference data itself.

Metadata Header

We start with an example of a file header.

# FILE NAME: 00001-00000001.soi
# TITLE: 2002 Dublin North
# DESCRIPTION: 
# DATA TYPE: soi
# MODIFICATION TYPE: original
# RELATES TO: 
# RELATED FILES: 00001-00000001.toc
# PUBLICATION DATE: 2013-08-17
# MODIFICATION DATE: 2022-09-16
# NUMBER ALTERNATIVES: 12
# NUMBER VOTERS: 43942
# NUMBER UNIQUE ORDERS: 19299
# ALTERNATIVE NAME 1: Cathal Boland F.G.
# ALTERNATIVE NAME 2: Clare Daly S.P.
# ALTERNATIVE NAME 3: Mick Davis S.F.
# ALTERNATIVE NAME 4: Jim Glennon F.F.
# ALTERNATIVE NAME 5: Ciaran Goulding Non-P
# ALTERNATIVE NAME 6: Michael Kennedy F.F.
# ALTERNATIVE NAME 7: Nora Owen F.G.
# ALTERNATIVE NAME 8: Eamonn Quinn Non-P
# ALTERNATIVE NAME 9: Sean Ryan Lab
# ALTERNATIVE NAME 10: Trevor Sargent G.P.
# ALTERNATIVE NAME 11: David Henry Walshe C.C. Csp
# ALTERNATIVE NAME 12: G.V. Wright F.F.

We now describe each of the metadata of the example header.

Whichever the file format, all these metadata have to be present. Additional metadata that are specific to the file can then be added.

Here are some general formatting rules:

Modification Type

Each data file is labeled as either Original, Induced, Imbued or Synthetic.

File Formats for Ordinal Preferences

The file formats for ordinal preferences are SOC, SOI, TOC, TOI. These four file formats are very similar: the metadata header has the same specification, the description of the preference differs.

Metadata Header

In addition to the metadata described above, the header of files representing ordinal preferences also include the following metadata.

Preferences

The preferences submitted by the respondents are encoded as described in the following. Each line indicates first the number of voters who submitted the given preference list, and then, after a column, the preference list. Inside a preference list, a strict ordering is indicated by comma, and indifference classes are grouped within curly brackets. Preferences are transitive.

We provide below two examples of this encoding: - 1: 1, 4, 3, 2: 1 respondent submitted the following preferences: alternative 1 is preferred to alternative 4, that is preferred to alternative 3, itself preferred to alternative 2. - 13: 1, {4, 3}, 2: 13 respondent submitted the following preferences: alternative 1 is preferred to alternatives 4 and 3, that are both preferred to alternative 2, but alternatives 4 and 3 are ranked at the same position.

Each file format has specific constraints as to which orders can appear:

It is mandatory that each file uses the most restrictive file format that is compatible with the preferences. So even though an SOC file can also be formatted as a TOI file, the SOC file format should be used.

It is mandatory that no orders appears more than once, the multiplicity value for each order is used for that.

To conclude, here is an example of the first lines of a data file of complete orders with ties (TOC) (taken from the debian election dataset).

# FILE NAME: 00002-00000001.toc
# TITLE: Debian 2002 Leader
# DESCRIPTION: Obtained from the soi by adding the unranked alternatives at the bottom
# DATA TYPE: toc
# MODIFICATION TYPE: imbued
# RELATES TO: 00002-00000001.soi
# RELATED FILES:
# PUBLICATION DATE: 2013-08-17
# MODIFICATION DATE: 2022-09-16
# NUMBER ALTERNATIVES: 4
# NUMBER VOTERS: 475
# NUMBER UNIQUE ORDERS: 31
# ALTERNATIVE NAME 1: Branden Robinson
# ALTERNATIVE NAME 2: Raphael Hertzog
# ALTERNATIVE NAME 3: Bdale Garbee
# ALTERNATIVE NAME 4: None Of The Above
100: 3,1,2,4
79: 1,3,2,4
54: 3,2,1,4
43: 2,3,1,4
34: 3,2,4,1
30: 1,2,3,4
29: 2,1,3,4
16: 1,3,4,2
14: 2,3,4,1
12: 3,1,4,2
9: 3,{1,2,4}

File Format for Categorical Preferences

The file format for categorical preferences is CAT.

Metadata Header

In addition to the metadata described above, the header of files representing categorical preferences also include the following metadata.

Preferences

The preferences submitted by the respondents are encoded as described in the following. Each line indicates first the number of voters who submitted the given preference list, and then, after a column, the preference list. Inside a preference list, each category is grouped around curly brackets, except for the categories with a single alternative. The empty category is represented as “{}”.

We provide below two examples of this encoding:

It is mandatory that no preference appears more than once, the multiplicity value for each preference is used for that.

To conclude, here is an example of the first lines of a CAT file from the French Approval dataset.

# FILE NAME: 00026-00000001.cat
# TITLE: GylesNonains
# DESCRIPTION:
# DATA TYPE: cat
# MODIFICATION TYPE: original
# RELATES TO:
# RELATED FILES: 00026-00000001.toc
# PUBLICATION DATE: 2017-04-13
# MODIFICATION DATE: 2022-09-16
# NUMBER ALTERNATIVES: 16
# NUMBER VOTERS: 365
# NUMBER UNIQUE PREFERENCES: 216
# NUMBER CATEGORIES: 2
# CATEGORY NAME 1: Yes
# CATEGORY NAME 2: No
# ALTERNATIVE NAME 1: Megret
# ALTERNATIVE NAME 2: Lepage
# ALTERNATIVE NAME 3: Gluckstein
... 	
13: 6,{1,2,3,4,5,7,8,9,10,11,12,13,14,15,16}
13: {},{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}
10: {9,10},{1,2,3,4,5,6,7,8,11,12,13,14,15,16}
10: {1,6},{2,3,4,5,7,8,9,10,11,12,13,14,15,16}

File Format for Weighted Matching

The file format for weighted matching preferences is WMD.

Metadata Header

In addition to the metadata described above, the header of files representing weighted matching preferences also include the following metadata.

Preferences

The preferences submitted by the respondents are encoded as described in the following. The preferences are viewed as a matching graph. The matching graph is described as a list of Source, Destination, Weight.

Here is an example of the first lines of a WMD file from the Kidney dataset

# FILE NAME: 00036-00000001.wmd
# TITLE: Kidney Matching - 16 with 0
# DESCRIPTION:
# DATA TYPE: wmd
# MODIFICATION TYPE: synthetic
# RELATES TO:
# RELATED FILES: 00036-00000001.dat
# PUBLICATION DATE: 2017-04-13
# MODIFICATION DATE: 2022-09-16
# NUMBER ALTERNATIVES: 16
# NUMBER VOTERS: 365
# NUMBER EDGES: 59
# ALTERNATIVE NAME 1: Pair 1
# ALTERNATIVE NAME 2: Pair 2
# ALTERNATIVE NAME 3: Pair 3
... 	
1,5,1.0
1,6,1.0
2,1,1.0
2,3,1.0

Extra Data File

When miscellaneous data are needed, we use the file extension DAT which has no specified format. CSV files are also sometimes used.