GIF art created by David Szakaly (Davidope)

Muriz Serifovic

Hi! 👋, I'm a Machine Learning Engineer at Paymira, where I train neural networks, bestowing low-rank updates into something useful. Besides work, my research interests include sparse autoencoders for model interpretability, and the mathematical foundations of learning theory (both in vivo and in silico). Previously ETH Zürich, Hesiod, ascarix, and Zühlke.

A question I keep gravitating around, is, why neural networks, fed by an insatiable appetite for compute, settle on some representations rather than others, which features emerge first, and which "solution" might be getting picked from the many that fit the data equally well, although Quine had this complaint long before anyone built a transformer:

Physical theories can be at odds with each other and yet compatible with all possible data even in the broadest sense.

Recent Projects

Hesiod

My ha-index is well above 80 (which is considered excellent).

About Me Visit Blog

GitHub

research work.

Nov. 2021

Topological Analysis of Alzheimer's Disease Progression via Persistent Homology

We give a tiny introduction to persistent homology, a tool from algebraic topology, and apply it to analyze spatiotemporal patterns in neurodegenerative diseases. Brain connectivity is modeled as a simplicial complex, where neurons are vertices and synaptic connections form higher-dimensional simplices. As disease progresses, we track topological features (connected components, loops, voids) across a filtration parameter $\epsilon$: $$ \beta_k(\epsilon) = \dim H_k(K_\epsilon), $$ where $\beta_k$ are the Betti numbers measuring $k$-dimensional holes.

Neuroscience

Mathematics

Medicine

Jun. 2021

Deep Learning Methods for Diabetic Retinopathy Classification in Fundus Images

A tiny literature review of deep learning approaches for classifying diabetic retinopathy (DR) severity from 2D fundus images. DR is usually graded on a 5-point scale, making this a standard multi-class classification problem where models minimize cross-entropy: $$ \mathcal{L} = -\sum_{c=0}^{4} y_c \log(\hat{p}_c). $$ We survey architectures (ResNet, EfficientNet, U-Net variants, etc.) and "tricks" for vessel segmentation, lesion detection, and severity grading as reported in the medical imaging literature.

Computer Vision

Medicine

Apr. 2020

A Categorical Framework for Shannon's Information Theory

What if we reformulate Claude Shannon's seminal "A Mathematical Theory of Communication" (1948) in the language of category theory (for fun, of course)? Entropy emerges naturally as a functor $H: \mathbf{FinProb} \to \mathbb{R}^+$ from the category of finite probability spaces to non-negative reals. The chain rule for entropy corresponds to a natural transformation, and conditional entropy arises from a fibered structure: $$ H(X,Y) = H(X) + H(Y \mid X). $$ This hints that many "information-theoretic identities" might be instances of "universal properties".

Jan. 2019

Reconstructing Sentences in YouTube Subtitles with Convolutional Neural Networks

Auto-generated YouTube subtitles (as of 2019) completely lack any punctuation, producing lowercase word streams like "the cat sat on the mat it was warm". We train a simple 1D CNN that scans text with a 30 word sliding window and predicts whether a punctuation mark should be inserted at each position. For each center "token", the model outputs a probability distribution: $$ P(y_i = c \mid x_{i-15:i+15}) = \text{softmax}(f_\theta(x_{i-15:i+15}))_c $$ where $c \in \{\text{none}, \text{period}, \text{comma}, \text{question}\}$. We conduct some simple ablation studies on window size and embedding dimensions, showing the model reliably learns sentences from context alone.

Natural Language Processing

Draft

Dec. 2018

Extracting Key Passages from YouTube Talks via Graph Centrality

Importance scores over time — Predicted importances (red) over time, smoothed by Weierstrass transform.

Given speech-to-text subtitles of YouTube talks, we identify the most "important" passages (who has time to watch a 3h talk?) using a graph-based approach inspired by PageRank, without any labeled data. Words are simply represented as GloVe embeddings, and connected based on semantic similarity. Sentence importance is then computed by aggregating word "centrality scores": $$ S(v_i) = (1-d) + d \sum_{v_j \in \mathcal{N}(v_i)} \frac{w_{ji}}{\sum_{v_k} w_{jk}} S(v_j) $$ where $d$ is a damping factor and $w_{ji}$ measures semantic relatedness between words.

Natural Language Processing

GitHub

Dec. 2017

Food Image Recognition for Recipe Retrieval

DeepChef dual-CNN architecture — Dual-CNN architecture: InceptionV3 and VGG-16 predictions fused via weighted scoring.

We propose a dual-CNN ensemble approach that combines semantic and visual similarity for food image retrieval. Given a query image $x$, two networks process it in parallel: InceptionV3 (fine-tuned on 230 food categories) produces category probabilities, while VGG-16 (pretrained on ImageNet) extracts a 4096-dimensional feature vector $\phi(x) \in \mathbb{R}^{4096}$ from its fc2 layer: $$ P(c \mid x) = \text{softmax}(W_{\text{Inc}} \cdot f_{\text{Inc}}(x))_c. $$ The VGG features are projected into PCA space $\phi_{\text{PCA}}(x) = T(\phi(x)) \in \mathbb{R}^{d'}$, where NMSLIB's HNSW algorithm performs approximate nearest neighbor search to retrieve $K=20$ visually similar recipe images. Our weighting_neural_net_inputs() function then "merges" both signals: for each ANN candidate $i$ with predicted category $c_i$, we compute a weighted score by matching it against InceptionV3's top-$K$ predictions: $$ s_i = \begin{cases} P(c_i \mid x) & \text{if } c_i \in \{\text{top-}K \text{ Inception categories}\} \\ 0 & \text{otherwise.} \end{cases} $$ Final results are sorted by $s_i$ in descending order. This weighted combination allows visually similar candidates (from ANN) to be re-ranked by semantic confidence (from InceptionV3), outperforming either network in isolation.

Machine Learning

GitHub

Oct. 2016

Genetic Algorithm for Image Approximation with Overlaid Image Instances

Mona Lisa approximated with 250 overlaid image instances.

Will Smith approximated over 120'000 generations.

Given a target image $T$ (e.g., Mona Lisa) and a transparent source image $S$, we seek to approximate $T$ by overlaying $N=250$ transformed copies of $S$. Each candidate solution is encoded as DNA $= [g_1, g_2, \ldots, g_{250}]$ where each gene $g_i = (x, y, s, \theta, \alpha) \in \mathbb{Z}^5$ specifies position, scale, rotation, and opacity. The optimization objective is: $$ \text{DNA}^* = \arg\max_{\text{DNA} \in \mathcal{D}} f(\text{DNA}) $$ where $\mathcal{D}$ is the search space of all possible DNA configurations. Because this landscape is non-convex with exponentially many local optima, we employ a genetic algorithm with tournament selection, crossover, and mutation. The fitness function measures image similarity via root mean square: $$ f(\text{DNA}) = \frac{1}{\sqrt{\frac{1}{N}\sum_{i=1}^{N} d_i^2}} $$ where $d_i$ is the pixel-wise RGB difference between $T$ and the rendered phenotype. After 120'000 generations, the algorithm achieves 93.071% similarity for the Mona Lisa.

Mathematical Optimization

GitHub

Nov. 2015

Approximating Images with a Single Continuous Thread

Portrait with string art — Portrait approximated by 6'000 line segments.

Thread path construction animation (4x speed).

Inspired by recent viral string art, we approximate any grayscale image using a single thread wound between nails arranged on a circular frame. Given $n$ nails, the ideal formulation is a constrained least-squares problem: let $A \in \mathbb{R}^{m \times \binom{n}{2}}$ encode pixel coverage for each possible line, and $y \in \mathbb{R}^m$ be the target image: $$ \min_{x \geq 0} \|Ax - y\|_2^2 \quad \text{subject to } x \text{ forms a valid path}. $$ In practice, we use a fast greedy heuristic: at each step, select the darkest line from the current pin (evaluating only $\sim 180$ candidates, not all $\binom{n}{2}$ possibilities), then move to its endpoint. This runs in $O(T \cdot n \cdot m)$ time where $T$ is the number of lines, $n$ is the number of pins, and $m$ is the number of pixels sampled per line. The continuous thread constraint ensures consecutive lines share an endpoint, producing a path that can be physically reconstructed. With e.g. 6'000 segments, images emerge from pure geometry.

Mathematical Optimization

GitHub