batch

Sampling elements with their nearest neighbors from data

MuyGPyS includes convenience functions for sampling batches of data from existing datasets. These batches are returned in the form of row indices, both of the sampled data as well as their nearest neighbors. Also included is the ability to sample “balanced” batches, where the data is partitioned by class and we attempt to sample as close to an equal number of items from each class as is possible.

MuyGPyS.optimize.batch.full_filtered_batch(nbrs_lookup, labels)[source]

Return a batch composed of the entire training set, filtering out elements with constant nearest neighbor sets.

Parameters:

nbrs_lookup (NN_Wrapper) – Trained nearest neighbor query data structure.
labels (ndarray) – List of class labels of shape (train_count,) for all train data.

Return type:

Tuple[ndarray, ndarray]

Returns:

indices – The indices of the sampled training points of shape (batch_count,).
nn_indices – The indices of the nearest neighbors of the sampled training points of shape (batch_count, nn_count).

MuyGPyS.optimize.batch.get_balanced_batch(nbrs_lookup, labels, batch_count)[source]

Decide whether to sample a balanced batch or return the full filtered batch.

This method is the go-to method for sampling from classification datasets when one desires a sample with equal representation of every class. The function simply calls MuyGPyS.optimize.batch.full_filtered_batch() if the supplied list of training data class labels is smaller than the batch count, otherwise calling MuyGPyS.optimize.batch_sample_balanced_batch().

Example

>>> import numpy as np
>>> From MuyGPyS.optimize.batch import get_balanced_batch
>>> train_features, train_responses = get_train()
>>> nn_count = 10
>>> nbrs_lookup = NN_Wrapper(train_features, nn_count)
>>> batch_count = 200
>>> train_labels = np.argmax(train_responses, axis=1)
>>> balanced_indices, balanced_nn_indices = get_balanced_batch(
...         nbrs_lookup, train_labels, batch_count
>>> )

Parameters:

nbrs_lookup (NN_Wrapper) – Trained nearest neighbor query data structure.
labels (ndarray) – List of class labels of shape (train_count,) for all training data.
batch_count (int) – int The number of batch elements to sample.

Return type:

Tuple[ndarray, ndarray]

Returns:

indices – The indices of the sampled training points of shape (batch_count,).
nn_indices – The indices of the nearest neighbors of the sampled training points of shape (batch_count, nn_count).

MuyGPyS.optimize.batch.sample_balanced_batch(nbrs_lookup, labels, batch_count)[source]

Collect a class-balanced batch of training indices.

The returned batch is filtered to remove samples whose nearest neighbors share the same class label, and is balanced so that each class is equally represented (where possible.)

Parameters:

nbrs_lookup (NN_Wrapper) – Trained nearest neighbor query data structure.
labels (ndarray) – List of class labels of shape (train_count,) for all train data.
batch_count (int) – The number of batch elements to sample.

Return type:

Tuple[ndarray, ndarray]

Returns:

nonconstant_balanced_indices – The indices of the sampled training points of shape (batch_count,). These indices are guaranteed to have nearest neighbors with differing class labels.
batch_nn_indices – The indices of the nearest neighbors of the sampled training points of shape (batch_count, nn_count).

MuyGPyS.optimize.batch.sample_batch(nbrs_lookup, batch_count, train_count)[source]

Collect a batch of training indices.

This is a simple sampling method where training examples are selected uniformly at random, without replacement.

Example

>>> From MuyGPyS.optimize.batch import sample_batch
>>> train_features, train_responses = get_train()
>>> train_count, _ = train_features.shape
>>> nn_count = 10
>>> nbrs_lookup = NN_Wrapper(train_features, nn_count)
>>> batch_count = 200
>>> batch_indices, batch_nn_indices = sample_batch(
...         nbrs_lookup, batch_count, train_count
>>> )

Parameters:

nbrs_lookup (NN_Wrapper) – Trained nearest neighbor query data structure.
batch_count (int) – The number of batch elements to sample.
train_count (int) – int The total number of training examples.

Return type:

Tuple[ndarray, ndarray]

Returns:

batch_indices – The indices of the sampled training points of shape (batch_count,).
batch_nn_indices – The indices of the nearest neighbors of the sampled training points of shape (batch_count, nn_count).