Data Mining | Hasan Kurban, Ph.D.

More Real Fantasy Leagues

Tue, 20 Jun 2023 00:00:00 +0000

Fantasy Sports has a current market size of 23B and is expected to grow more than 84B in less than a decade. The intent is to create virtual teams that somehow reflect what would happen if the players actually played. Using individual player stats, models predict an outcome. But fans are left wanting more. To achieve a more realistic outcome (something more than just statistics), aspects of what makes live teams win need to be included: (1) transforming stats to reflect the relative importance with respect to a position; (2) team chemistry (TC). In this work, we show a novel characterization of relative position statistics and a new description of TC. Drawn from the NBA’s API, we form a data set to determine whether a fantasy team makes the playoffs using over two dozen features, including TC. Our system has over 100 models competing to be the best predictor with compelling results. To allow users to use this more realistic fantasy team model, we offer a web service (continually updating from the API) and inspect not only virtual teams and TC but simulate match-ups with existing 2023 NBA teams and visualizations helping to improve team creation. Our web service can be accessed at https://dalkilic.luddy.indiana.edu/fantasyleague/, and the source code can be found in our https://github.com/gany-15/nbafan.

Geometric-k-means

Sun, 11 Jun 2023 00:00:00 +0000

K-means is among the most widely used ML algorithms, even though it is more than 50 years old. Recent optimizations focus on reducing distance calculations (DC). Two approaches dominate: bounded and unbounded. Bounded DC place a priori bounds on the number of reductions, while unbounded does not. Unbounded exists as a popular improvement called Ball-k-means that determines what data can be safely ignored for a subsequent iteration. In this work, we describe a novel second unbounded DC reduction by leveraging geometry: specially, scalar projection, that helps reduce more DC. This approach is linear w.r.t. to the centroid’s members, unlike Ball-k-means, which requires sorting. Experiments on real-world and synthetic data demonstrate that geometric approach, which we call Geometric-k-means (or Geo-k-means for short), is significantly better. Additionally, replacing multiple instances of k-means in a state-of-the-art (SOTA) pipeline for high dimensional computational biology data with Geo-k-means and Ball-k-means demonstrates that this approach is superior in a real-world application. Geo-k-means relies on the notion of data expressiveness, where high expressive data significantly affects the objective function, while low expressive does not. By separating high/low expressive data in each iteration, we effectively ignore a significant number of DC which results in substantially reduced run-time while preserving the accuracy.

Sports Awards

Sat, 10 Jun 2023 00:00:00 +0000

Sports awards have become almost as popular as the sports themselves bringing not only recognition, but also increases in salary, more control over decisions usually in the hands of coaches and general managers, and other benefits. Awards are so popular that even the start of a season pundits and amateurs alike predict or argue for athletes. It is odd that something so apparently data-driven does not work in determining whether it is, indeed, data-driven. The simple question arises, ``Are sports awards about sports?" Using ML (over a hundred potential models) this work aims to answer this question for professional basketball: Most Valuable Player, Most Improved Player, Rookie of the Year, and Defensive Player of the Year. Pertinent data is gathered including voting percentages. Our results are very interesting. MVP can be predicted well from the data, while the other three are more difficult. The findings suggest that either the data is insufficient (although no more sports data can be found) or more likely non-tangible factors are playing as critical roles. This outcome is worth reflecting on for fans of all stripes: should sports award be about sports? The source code can be found in our https://github.com/Nebbocaj/NBA_Awards.

Data-centric AI

Mon, 01 Nov 2021 00:00:00 +0000

To deal with the unimaginable continual growth of data and the focus on its use rather than its governance, the value of data has begun to deteriorate seen in lack of reproducibility, validity, provenance, etc. In this work, we aim to simply understand what is the value of data and how this basic understanding might affect existing AI algorithms, in particular, EM-T (traditional expectation maximization) used in soft clustering and EM (a data-centric extension of EM-T). We have discovered that the value of data–or its “expressiveness" as we call it–is procedurally determined and runs the gamut from low expressiveness (LE) to high expressiveness (HE), the former not affecting the objective function much, while the latter a great deal. By using balanced binary search trees (BST) (complete orders) introduced here, we have improved on our earlier work that utilized heaps (partial orders) to separate LE from HE data. EM-DC (expectation maximization-data centric) significantly improve the performance of EM-T on big data. EM-DC is an improvement over EM by allowing more efficient identification of LE/HE data and its placement in the BST. Outcomes of this, aside from significant reduction in run-time over EM , while maintaining EM-T accuracy, include being able to isolate noisy data, convergence on data structures (using Hamming distance) rather than real-values, and the ability for the user to dictate the relative mixture of LE/HE acceptable for the run. The Python code and links to the data sets are provided in the paper. We additionally have released an R version (https://cran.r-project.org/web/packages/DCEM/index.html) that includes EM-T, EM , and k++ initialization.

R Package

Mon, 20 Apr 2020 00:00:00 +0000

In this paper we introduced the DCEM package for clustering big data. DCEM is developed using R and is publically accessible on the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/. In particular we demonstrated that significant improvements in run time and number of iterations are achieved by embedding a heap structure within the framework of the traditional expectation maximisation algorithm Dempster et al. (1977). Our package makes use of the proposed data driven approach Kurban, Jenne, and Dalkilic (2016b) that, (1) avoids visiting all data and, (2) avoid continually re-visiting the data, to speed up the convergence process. We illustrated the practical utility of DCEM by performing several experiments across application domains that highlight the significant improvement in performance. In future, we would like to 1) extend this approach to use data structures that enforce total ordering for example, kd-tree. Min heap is a partially ordered structure and hence does not enforece a strict oredering on the data. We argue that by using a totally ordered structure, we can further improve the performance of iterative learning algorithms, 2) use this approach in contemporary iterative data mining algorithms i.e., k-means, an example of such work is also discussed in Kurban and Dalkilic (2017).

Iterative Machine Learning

Mon, 11 Dec 2017 00:00:00 +0000

In this work, we have described an optimization approach that can be used over any iterative optimization algorithms to improve their training run-time complexity. To the best of our knowledge, this is the first work that theoretically shows convergence of iterative algorithms over heap structure–instead of over a cost function. This approach is tested over k-means (KM) and expectation-maximization algorithm (EM-T). The experimental results show dramatic improvements over KM and EM-T training run-time through different kinds of testing: scale, dimension, and separability. Regarding cluster error, the traditional algorithms’ and our extended algorithms’ performances are similar. For future work, clearly one obvious step is to add seeding to KM* drawing from both k-means++ and kd trees. Additionally,we are interested in the broader question of this approach to iterative converging algorithms. Further, are there better structures than heaps? Lastly, parallelization offers some new challenges, but also opportunities.

Clustering Big Data

Fri, 01 Sep 2017 00:00:00 +0000

Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like Expectation Maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a non-linear hierarchical data structure (heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM*. We show our EM* algorithm outperform EM-T algorithm over large real world and synthetic data sets. We lastly conclude with some theoretic underpinnings that explain why EM* is successful.

Studying the Milky Way

Mon, 15 Sep 2014 00:00:00 +0000

Dramatic increases in the amount and complexity of stellar data must be matched by new or refined algorithms that can help scientists make sense of this data and so better understand the universe. ParaHeap-k is a parallel cluster algorithm for analyzing big data that can potentially prove useful to astronomical research.