Dataset Generation
HELIX includes a few utilities for generating simple datasets with known ground truth from a library of Blueprints, Components, and Transforms. These datasets fall into two categories:
- Similarity Datasets
Ground truth labels for these datasets consists of labels for individual samples gathered by aggregating the labels of the included Components. The “similarity” of any two samples can be computed by comparing their lists of labels. These datasets are useful for evaluating program similarity approaches.
- Classification Datasets
Samples in these datasets are assigned to a synthetic “class”, where members of the class will be more similar to other members of the same class and different from members of other classes. These datasets are useful for evaluating classification approaches based on program similarity.
Generating Similarity Datasets
HELIX provides the aptly named dataset-similarity
command for generating
similarity datasets using combinations of Components. There are a number of
different Component selection strategies available. In the simplest case, the
following command uses the simple
strategy (where only a single Component
is included per sample), writing results to a directory named dataset
, for
a few different configurations of the configuration-example
Component.
helix dataset-similarity simple dataset \
-c configuration-example:first_word=hello,second_word=world \
configuration-example:first_word=bonjour,second_word='le monde' \
configuration-example:first_word=ciao,second_word=mondo \
configuration-example:first_word=hola,second_word=mundo \
configuration-example:first_word=hallo,second_word=welt
The generated dataset consists of five samples, one for each included Component
configuration. Build output is logged to the sample directories in dataset
and dataset labels are written to dataset/labels.json
.
The simple
strategy isn’t much more than a sanity check - more
sophisticated strategies are also supported: random
which randomly selects
combinations of the provided Components and walk
which randomly selects an
initial combination of Components, then randomly permutes a small portion of
them each time. Supported Transforms can also be applied to all samples in a
dataset. For example, the following command generates a dataset using the
random
strategy with the same Components and configurations above as well
as the minimal-example
Component, including three Components per sample,
and applying the strip
Transform to all samples:
helix dataset-similarity random dataset \
--sample-count 25 \
--component-count 3 \
-c minimal-example \
configuration-example:first_word=hello,second_word=world \
configuration-example:first_word=bonjour,second_word='le monde' \
configuration-example:first_word=ciao,second_word=mondo \
configuration-example:first_word=hola,second_word=mundo \
configuration-example:first_word=hallo,second_word=welt \
-t strip
Generating Classification Datasets
Coming soon…
External Components
Dataset generation also supports loading Components from external sources and
downstream libraries using the helix.component.Loader
interface.