Toby Dylan Hocking

2025-05-12

ArXiv (prépublication)

Learning Penalty for Optimal Partitioning via Automatic Feature Extraction

Tung L. Nguyen

Changepoint detection identifies significant shifts in data sequences, making it important in areas like finance, genetics, and healthcare. … (voir plus)The Optimal Partitioning algorithms efficiently detect these changes, using a penalty parameter to limit the changepoints number. Determining the appropriate value for this penalty can be challenging. Traditionally, this process involved manually extracting statistical features, such as sequence length or variance to make the prediction. This study proposes a novel approach that uses recurrent neural networks to learn this penalty directly from raw sequences by automatically extracting features. Experiments conducted on 20 benchmark genomic datasets show that this novel method surpasses traditional methods in partitioning accuracy in most cases.

2025-05-01

arXiv (publié)

Cross-validation for training and testing co-occurrence network inference algorithms

Daniel Agyapong

Jeffrey Ryan Propster

Jane Marks

2025-03-06

BMC Bioinformatics (publié)

Interval Regression: A Comparative Study with Proposed Models

Tung L. Nguyen

Regression models are essential for a wide range of real-world applications. However, in practice, target values are not always precisely kn… (voir plus)own; instead, they may be represented as intervals of acceptable values. This challenge has led to the development of Interval Regression models. In this study, we provide a comprehensive review of existing Interval Regression models and introduce alternative models for comparative analysis. Experiments are conducted on both real-world and synthetic datasets to offer a broad perspective on model performance. The results demonstrate that no single model is universally optimal, highlighting the importance of selecting the most suitable model for each specific scenario.

2025-03-03

ArXiv (prépublication)

Interval Regression: A Comparative Study with Proposed Models

Tung L. Nguyen

Regression models are essential for a wide range of real-world applications. However, in practice, target values are not always precisely kn… (voir plus)own; instead, they may be represented as intervals of acceptable values. This challenge has led to the development of Interval Regression models. In this study, we provide a comprehensive review of existing Interval Regression models and introduce alternative models for comparative analysis. Experiments are conducted on both real-world and synthetic datasets to offer a broad perspective on model performance. The results demonstrate that no single model is universally optimal, highlighting the importance of selecting the most suitable model for each specific scenario.

2025-03-01

arXiv (publié)

Assessment of the Climate Trace global powerplant CO<sub>2</sub> emissions

Kevin R Gurney

Bilal Aslam

Pawlok Dass

Lech Gawuc

Jarrett J Barber

Anna Kato

2024-10-17

Environmental Research Letters (publié)

Efficient line search for optimizing Area Under the ROC Curve in gradient descent

Jadon Fowler

Receiver Operating Characteristic (ROC) curves are useful for evaluation in binary classification and changepoint detection, but difficult t… (voir plus)o use for learning since the Area Under the Curve (AUC) is piecewise constant (gradient zero almost everywhere). Recently the Area Under Min (AUM) of false positive and false negative rates has been proposed as a differentiable surrogate for AUC. In this paper we study the piecewise linear/constant nature of the AUM/AUC, and propose new efficient path-following algorithms for choosing the learning rate which is optimal for each step of gradient descent (line search), when optimizing a linear model. Remarkably, our proposed line search algorithm has the same log-linear asymptotic time complexity as gradient descent with constant step size, but it computes a complete representation of the AUM/AUC as a function of step size. In our empirical study of binary classification problems, we verify that our proposed algorithm is fast and exact; in changepoint detection problems we show that the proposed algorithm is just as accurate as grid search, but faster.

2024-10-11

ArXiv (prépublication)

SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets

Gabrielle Thibault

C. Bodine

Paul Nelson Arellano

Alexander F Shenkin

Olivia J. Lindly

In many real-world applications of machine learning, we are interested to know if it is possible to train on the data that we have gathered … (voir plus)so far, and obtain accurate predictions on a new test data subset that is qualitatively different in some respect (time period, geographic region, etc). Another question is whether data subsets are similar enough so that it is beneficial to combine subsets during model training. We propose SOAK, Same/Other/All K-fold cross-validation, a new method which can be used to answer both questions. SOAK systematically compares models which are trained on different subsets of data, and then used for prediction on a fixed test subset, to estimate the similarity of learnable/predictable patterns in data subsets. We show results of using SOAK on six new real data sets (with geographic/temporal subsets, to check if predictions are accurate on new subsets), 3 image pair data sets (subsets are different image types, to check that we get smaller prediction error on similar images), and 11 benchmark data sets with predefined train/test splits (to check similarity of predefined splits).

2024-10-11

ArXiv (prépublication)

Assessment of the Climate Trace global powerplant CO2 emissions

Kevin R. Gurney

Bilal Aslam

Pawlok Dass

Lech Gawuc

Jarrett J Barber

Anna Kato

Accurate estimation of planetary greenhouse gas (GHG) emissions at the scale of individual emitting activities is a critical need for both s… (voir plus)cientific and policy applications. Powerplants represent the single largest and most concentrated form of global GHG emissions. Climate Trace, co-founded and promoted by former U.S. Vice President Al Gore, is a new effort using, in part, artificial intelligence (AI) approaches to estimate asset-scale GHG emissions. Climate Trace recently released a database of global powerplant CO2 emissions at the facility-scale that uses both AI and non-AI estimation approaches. However, no independent peer-reviewed assessment has been made of this important global emissions database. Here, we compare the Climate Trace powerplant CO2 emissions to an atmospherically calibrated, multi-constraint estimate of powerplant CO2 emissions in the United States. The 3.7% (65) of compared facilities that used an AI-based approach show a mean relative difference (MRD) of −1.1% (SD: 46.4%) in the year 2019. The 96.3% (1726) of the facilities that used a non-AI-based approach show a MRD of −50.0% (SD: 117.7%). Of the non-AI estimated facilities, 151 (8.7%) facilities agree to within ±20%. The large differences between Climate Trace and Vulcan-power emission estimates for these facilities is primarily caused by Climate Trace’ use of a national-mean power plant capacity factor (CF) which is a poor representation of the reported power plant CFs of individual US facilities and leads to very large errors at those same 1726 facilities.

2024-10-04

Environmental Research Letters (publié)

Efficient line search for optimizing Area Under the ROC Curve in gradient descent

Jadon Fowler

Receiver Operating Characteristic (ROC) curves are useful for evaluation in binary classification and changepoint detection, but difficult t… (voir plus)o use for learning since the Area Under the Curve (AUC) is piecewise constant (gradient zero almost everywhere). Recently the Area Under Min (AUM) of false positive and false negative rates has been proposed as a differentiable surrogate for AUC. In this paper we study the piecewise linear/constant nature of the AUM/AUC, and propose new efficient path-following algorithms for choosing the learning rate which is optimal for each step of gradient descent (line search), when optimizing a linear model. Remarkably, our proposed line search algorithm has the same log-linear asymptotic time complexity as gradient descent with constant step size, but it computes a complete representation of the AUM/AUC as a function of step size. In our empirical study of binary classification problems, we verify that our proposed algorithm is fast and exact; in changepoint detection problems we show that the proposed algorithm is just as accurate as grid search, but faster.

2024-10-01

arXiv (publié)

Finite Sample Complexity Analysis of Binary Segmentation

Binary segmentation is the classic greedy algorithm which recursively splits a sequential data set by optimizing some loss or likelihood fun… (voir plus)ction. Binary segmentation is widely used for changepoint detection in data sets measured over space or time, and as a sub-routine for decision tree learning. In theory it should be extremely fast for

2024-10-01

arXiv (publié)

SOAK: Same/Other/All K-fold cross-validation for estimating similarity of patterns in data subsets

Gabrielle Thibault

C. S. Bodine

Paul Nelson Arellano

Alexander F Shenkin

Olivia J. Lindly

In many real-world applications of machine learning, we are interested to know if it is possible to train on the data that we have gathered … (voir plus)so far, and obtain accurate predictions on a new test data subset that is qualitatively different in some respect (time period, geographic region, etc). Another question is whether data subsets are similar enough so that it is beneficial to combine subsets during model training. We propose SOAK, Same/Other/All K-fold cross-validation, a new method which can be used to answer both questions. SOAK systematically compares models which are trained on different subsets of data, and then used for prediction on a fixed test subset, to estimate the similarity of learnable/predictable patterns in data subsets. We show results of using SOAK on six new real data sets (with geographic/temporal subsets, to check if predictions are accurate on new subsets), 3 image pair data sets (subsets are different image types, to check that we get smaller prediction error on similar images), and 11 benchmark data sets with predefined train/test splits (to check similarity of predefined splits).

2024-10-01

arXiv (publié)