package nx-datasets
Install
Dune Dependency
Authors
Maintainers
Sources
sha256=a9a8a9787f8250337187bb7b21cb317c41bfd2ecf08bcfe0ab407c7b6660764d
sha512=fe13cf257c487e41efe2967be147d80fa94bac8996d3aab2b8fd16f0bbbd108c15e0e58c025ec9bf294d4a0d220ca2ba00c3b1b42fa2143f758c5f0ee4c15782
doc/nx-datasets/Nx_datasets/index.html
Module Nx_datasets
Source
Dataset loading and generation utilities for Nx.
This module provides functions to load common machine learning datasets and generate synthetic datasets for testing and experimentation. Real datasets are downloaded and cached in the platform-specific cache directory.
Loading Real Datasets
Functions to load classic machine learning datasets as Nx tensors.
Image Datasets
load_mnist ()
loads MNIST handwritten digits dataset.
Returns training and test sets with images as uint8 tensors of shape |samples; 28; 28; 1|
and labels as uint8 tensors of shape |samples; 1|
. Training set has 60,000 samples, test set has 10,000 samples.
Loading MNIST and checking shapes:
let (x_train, y_train), (x_test, y_test) = Nx_datasets.load_mnist () in
Nx.shape x_train = [| 60000; 28; 28; 1 |]
&& Nx.shape y_train = [| 60000; 1 |]
&& Nx.shape x_test = [| 10000; 28; 28; 1 |]
&& Nx.shape y_test = [| 10000; 1 |]
load_fashion_mnist ()
loads Fashion-MNIST clothing dataset.
Returns same format as MNIST: images as uint8 tensors of shape |samples; 28; 28; 1|
and labels as uint8 tensors of shape |samples; 1|
.
load_cifar10 ()
loads CIFAR-10 color image dataset.
Returns training and test sets with images as uint8 tensors of shape |samples; 32; 32; 3|
and labels as uint8 tensors of shape |samples; 1|
. Training set has 50,000 samples, test set has 10,000 samples.
Tabular Datasets
load_iris ()
loads Iris flower classification dataset.
Returns features as float64 tensor of shape |150; 4|
and labels as int32 tensor of shape |150; 1|
. Features are sepal length/width and petal length/width. Labels are 0 (setosa), 1 (versicolor), 2 (virginica).
load_breast_cancer ()
loads Breast Cancer Wisconsin dataset.
Returns features as float64 tensor of shape |569; 30|
and labels as int32 tensor of shape |569; 1|
. Labels are 0 (malignant) or 1 (benign).
load_diabetes ()
loads diabetes regression dataset.
Returns features as float64 tensor of shape |442; 10|
and targets as float64 tensor of shape |442; 1|
. Target is quantitative measure of disease progression one year after baseline.
load_california_housing ()
loads California housing prices dataset.
Returns features as float64 tensor of shape |20640; 8|
and targets as float64 tensor of shape |20640; 1|
. Target is median house value in hundreds of thousands of dollars.
Time Series Datasets
load_airline_passengers ()
loads monthly airline passenger counts.
Returns int32 tensor of shape |144|
containing monthly passenger totals from 1949 to 1960.
Generating Synthetic Datasets
Functions to generate synthetic datasets with controlled properties for algorithm development and testing.
Classification Dataset Generators
val make_blobs :
?n_samples:int ->
?n_features:int ->
?centers:[ `N of int | `Array of Nx.float32_t ] ->
?cluster_std:float ->
?center_box:(float * float) ->
?shuffle:bool ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.int32_t
make_blobs ?n_samples ?n_features ?centers ?cluster_std ?center_box ?shuffle ?random_state ()
generates isotropic Gaussian blobs.
Creates clusters of points with each cluster drawn from a normal distribution. Returns features and integer labels.
Generating 3 well-separated 2D clusters:
let x, y = Nx_datasets.make_blobs ~centers:(`N 3) ~cluster_std:0.5 () in
Nx.shape x = [| 100; 2 |] && Nx.shape y = [| 100 |]
val make_classification :
?n_samples:int ->
?n_features:int ->
?n_informative:int ->
?n_redundant:int ->
?n_repeated:int ->
?n_classes:int ->
?n_clusters_per_class:int ->
?weights:float list ->
?flip_y:float ->
?class_sep:float ->
?hypercube:bool ->
?shift:float ->
?scale:float ->
?shuffle:bool ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.int32_t
make_classification ?n_samples ?n_features ?n_informative ...
generates random n-class classification problem.
Creates a dataset with controllable characteristics including informative, redundant, and useless features. Useful for testing feature selection.
Creating a binary classification dataset:
let x, y =
Nx_datasets.make_classification ~n_features:10 ~n_informative:3
~n_redundant:1 ()
in
Nx.shape x = [| 100; 10 |] && Nx.shape y = [| 100 |]
val make_gaussian_quantiles :
?mean:float array ->
?cov:float ->
?n_samples:int ->
?n_features:int ->
?n_classes:int ->
?shuffle:bool ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.int32_t
make_gaussian_quantiles ?mean ?cov ?n_samples ...
generates isotropic Gaussian divided by quantiles.
Divides a single Gaussian cluster into near-equal-size classes separated by concentric hyperspheres. Useful for testing algorithms that assume Gaussian distributions.
val make_hastie_10_2 :
?n_samples:int ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.int32_t
make_hastie_10_2 ?n_samples ?random_state ()
generates Hastie et al. 2009 binary problem.
Generates 10-dimensional dataset where y = 1 if sum(x_i^2) > 9.34 else 0. Standard benchmark for binary classification.
val make_circles :
?n_samples:int ->
?shuffle:bool ->
?noise:float ->
?random_state:int ->
?factor:float ->
unit ->
Nx.float32_t * Nx.int32_t
make_circles ?n_samples ?shuffle ?noise ?random_state ?factor ()
generates concentric circles.
Creates a large circle containing a smaller circle in 2D. Tests algorithms' ability to learn non-linear boundaries.
Creating noisy concentric circles:
let x, y = Nx_datasets.make_circles ~noise:0.1 ~factor:0.5 () in
Nx.shape x = [| 100; 2 |]
&& Array.for_all (fun v -> v = 0 || v = 1) (Nx.to_array y)
val make_moons :
?n_samples:int ->
?shuffle:bool ->
?noise:float ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.int32_t
make_moons ?n_samples ?shuffle ?noise ?random_state ()
generates two interleaving half circles.
Creates two half-moon shapes. Tests algorithms' ability to handle non-convex clusters.
Multilabel Classification
val make_multilabel_classification :
?n_samples:int ->
?n_features:int ->
?n_classes:int ->
?n_labels:int ->
?length:int ->
?allow_unlabeled:bool ->
?sparse:bool ->
?return_indicator:bool ->
?return_distributions:bool ->
?random_state:int ->
unit ->
Nx.float32_t * [ `Float of Nx.float32_t | `Int of Nx.int32_t ]
make_multilabel_classification ?n_samples ?n_features ...
generates random multilabel problem.
Creates samples with multiple labels per instance. Models bag-of-words with multiple topics per document.
Returns (X, Y) where Y type depends on return_indicator:
- false: `Int with shape
n_samples; n_labels
containing label indices - true: `Float with shape
n_samples; n_classes
containing binary indicators
Regression Dataset Generators
val make_regression :
?n_samples:int ->
?n_features:int ->
?n_informative:int ->
?n_targets:int ->
?bias:float ->
?effective_rank:int option ->
?tail_strength:float ->
?noise:float ->
?shuffle:bool ->
?coef:bool ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.float32_t * Nx.float32_t option
make_regression ?n_samples ?n_features ...
generates random regression problem.
Creates linear combination of random features with optional noise and low-rank structure.
Creating multi-output regression:
let x, y, coef =
Nx_datasets.make_regression ~n_features:20 ~n_informative:5 ~n_targets:2
~coef:true ()
in
Nx.shape x = [| 100; 20 |]
&& Nx.shape y = [| 100; 2 |]
&& match coef with Some c -> Nx.shape c = [| 20; 2 |] | None -> false
make_sparse_uncorrelated ?n_samples ?n_features ?random_state ()
generates sparse uncorrelated design.
Only first 4 features affect target: y = x0 + 2*x1 - 2*x2 - 1.5*x3
val make_friedman1 :
?n_samples:int ->
?n_features:int ->
?noise:float ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.float32_t
make_friedman1 ?n_samples ?n_features ?noise ?random_state ()
generates Friedman #1 problem.
Features uniformly distributed on 0, 1
. Output: y = 10 * sin(pi * x0 * x1) \+ 20 * (x2 - 0.5)^2 + 10 * x3 + 5 * x4 + noise
val make_friedman2 :
?n_samples:int ->
?noise:float ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.float32_t
make_friedman2 ?n_samples ?noise ?random_state ()
generates Friedman #2 problem.
Four features with ranges: x0 in 0,100
, x1 in 40,560
, x2 in 0,1
, x3 in 1,11
. Output: y = sqrt(x0^2 + (x1 * x2 - 1/(x1 * x3))^2) + noise
val make_friedman3 :
?n_samples:int ->
?noise:float ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.float32_t
make_friedman3 ?n_samples ?noise ?random_state ()
generates Friedman #3 problem.
Four features with same ranges as Friedman #2. Output: y = arctan((x1 * x2 - 1/(x1 * x3)) / x0) + noise
Manifold Learning Generators
val make_s_curve :
?n_samples:int ->
?noise:float ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.float32_t
make_s_curve ?n_samples ?noise ?random_state ()
generates S-curve dataset.
Creates 3D S-shaped manifold. Returns points and their position along curve.
Returns (X, t) where X has shape n_samples; 3
and t has shape n_samples
val make_swiss_roll :
?n_samples:int ->
?noise:float ->
?random_state:int ->
?hole:bool ->
unit ->
Nx.float32_t * Nx.float32_t
make_swiss_roll ?n_samples ?noise ?random_state ?hole ()
generates swiss roll dataset.
Creates 3D swiss roll manifold. Returns points and their position along roll.
Returns (X, t) where X has shape n_samples; 3
and t has shape n_samples
Matrix Decomposition Generators
val make_low_rank_matrix :
?n_samples:int ->
?n_features:int ->
?effective_rank:int ->
?tail_strength:float ->
?random_state:int ->
unit ->
Nx.float32_t
make_low_rank_matrix ?n_samples ?n_features ?effective_rank ...
generates mostly low-rank matrix.
Creates matrix with bell-shaped singular value profile.
val make_sparse_coded_signal :
n_samples:int ->
n_components:int ->
n_features:int ->
n_nonzero_coefs:int ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.float32_t * Nx.float32_t
make_sparse_coded_signal ~n_samples ~n_components ~n_features ~n_nonzero_coefs ?random_state ()
generates sparse signal.
Creates signal Y = D * X where D is dictionary and X is sparse code.
Returns (Y, D, X) where:
- Y has shape
n_features; n_samples
(encoded signal) - D has shape
n_features; n_components
(dictionary) - X has shape
n_components; n_samples
(sparse codes)
make_spd_matrix ?n_dim ?random_state ()
generates symmetric positive-definite matrix.
Creates random SPD matrix using A^T * A + epsilon * I.
val make_sparse_spd_matrix :
?n_dim:int ->
?alpha:float ->
?norm_diag:bool ->
?smallest_coef:float ->
?largest_coef:float ->
?random_state:int ->
unit ->
Nx.float32_t
make_sparse_spd_matrix ?n_dim ?alpha ...
generates sparse symmetric positive-definite matrix.
Creates sparse SPD matrix with controllable sparsity.
Biclustering Generators
val make_biclusters :
?shape:(int * int) ->
?n_clusters:int ->
?noise:float ->
?minval:int ->
?maxval:int ->
?shuffle:bool ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.int32_t * Nx.int32_t
make_biclusters ?shape ?n_clusters ...
generates constant block diagonal structure.
Creates matrix with block diagonal biclusters.
Returns (X, row_labels, col_labels) indicating bicluster membership
val make_checkerboard :
?shape:(int * int) ->
?n_clusters:(int * int) ->
?noise:float ->
?minval:int ->
?maxval:int ->
?shuffle:bool ->
?random_state:int ->
unit ->
Nx.float32_t * Nx.int32_t * Nx.int32_t
make_checkerboard ?shape ?n_clusters ...
generates checkerboard structure.
Creates matrix with checkerboard pattern of high/low values.
Returns (X, row_labels, col_labels) indicating cluster membership