pile of thoughts

2025-10-06T20:59:03+00:00

When I first co-founded my startup, I had no formal experience in leadership. I built a team from scratch — sourcing every early hire through LinkedIn, conducting interviews myself, and making offers based on conviction and chemistry. It was a crash course in leadership, and while we achieved a lot, looking back, there’s so much I would do differently today.

Great bosses aren’t born overnight – they’re shaped by the habits they practice every day. After reflecting on both my own experience and lessons from leadership books, I’ve come to believe in five core practices that truly define great leaders when they’re applied consistently.

1. Give Clear Direction

At my startup, I shared a strong vision that attracted incredible people — some joined even when the compensation didn’t match their market value. Our mission was inspiring enough to unite us. But here’s the hard truth: I didn’t revisit that vision often enough.

A compelling vision doesn’t live in a single all-hands or pitch deck. It has to be retold, refined, and reinforced every 90 days so that people don’t just hear it but own it. Hearing it once isn’t enough — they need to understand it deeply and connect it to their own goals. I learned this too late, and it’s something I would prioritize if I were leading a team again.

2. Provide the Necessary Tools

I always made sure my team had access to resources — GPUs, papers, new frameworks, anything that could help them explore faster. What I overlooked, though, was that my time was often the most valuable resource I could give.

Resources like training, technology, and extra help matter – but a leader’s attention matters most. I used to assume casual chats or shared code reviews were enough, but structured one-on-ones might have uncovered deeper needs or bottlenecks. The simplest way to confirm that your team has what they need to do great work? Just ask.

3. Let Go

Once expectations are clear and people have what they need, leaders need to resist the urge to micromanage. Early on, I sometimes over-involved myself because the work was deep tech and difficult to measure. In research-driven environments, tangible progress is often fuzzy — you might be “stuck” for weeks, yet still be advancing.

Because I didn’t set clear expectations from the beginning, I found myself constantly clarifying, defending, or justifying the team’s progress to external stakeholders. It drained energy that could have been spent supporting the team. The best lesson I learned: focus on people who understand their roles, genuinely want the responsibility, and have the ability to deliver — then give them space to succeed.

4. Act with the Greater Good in Mind

Short-term wins can be tempting, especially under investor pressure. But I’ve learned that when you consistently choose what’s best for the team or organization — even if it’s harder in the short term — you build lasting credibility.

When I had to argue for my team’s value during tough discussions about layoffs, I realized how crucial it is for leadership actions to align with the long-term vision. Acting with integrity and transparency, even under stress, shapes how people remember your leadership long after you leave.

5. Set Aside Time to Reflect

In the early days, I was always in execution mode — hiring, coding, firefighting. I rarely paused to step back and reflect. Yet reflection is where real growth happens. Whether it’s a quiet hour each week or a full-day offsite, stepping back helps you see the bigger picture.

I now believe leaders need to schedule reflection the way they schedule standups. It’s not indulgence — it’s maintenance for clear thinking and better decisions.

The Management Habits That Sustain Leadership

Leadership sets the direction, but management keeps the ship steady. Here are five management practices that go hand in hand with leadership — ones I’ve learned both by missing and by doing.

1. Set Clear Expectations

I used to assume my team understood priorities and responsibilities intuitively. We were a small, tight-knit group, so I thought shared context was enough. It wasn’t. Without explicit expectations, accountability becomes ambiguous.

People need to know their roles, the values that guide decisions, the priorities that matter most, and the results they’re accountable for. Without clarity in these areas, accountability is impossible.

2. Communicate Well

In my first startup, communication often happened informally — over Slack threads or coffee chats. It worked surprisingly well for creativity, but not always for alignment. I learned that great communication isn’t about talking often, it’s about checking for understanding.

Don’t rely on assumptions – ask questions, listen carefully, and verify mutual understanding. Open dialogue builds trust and ensures that nothing important is left unsaid.

3. Establish a Steady Meeting Rhythm

I didn’t set regular one-on-ones officially, though I did have frequent informal discussions. Those casual chats were fantastic for motivation and brainstorming, but they lacked structure. If I could redo it, I’d add a weekly rhythm — team meetings with clear agendas and one-on-ones during the first 90 days for every new hire.

That early attention aligns expectations quickly and prevents misalignment later on.

4. Hold Quarterly Conversations

This is one of my biggest takeaways from hindsight. We never had formal quarterly off-sites, and I now realize how powerful they could have been. Meeting off-site, away from daily distractions, to discuss what’s working, what isn’t, and how each person is living up to their role, values, and priorities — that’s where deeper connection and synergy form.

Quarterly conversations keep relationships and goals from fraying, even in fast-moving startups.

5. Reward and Recognize

At my startup, we were strong on passion but weak on celebration. Feedback and recognition happened naturally but sporadically. I’ve since seen other teams do this much better — giving quick, public praise and private, constructive feedback within 24 hours.

It’s such a small thing, but consistent recognition creates momentum and trust.

Building a Culture of Trust and Accountability

Together, these ten practices don’t just create accountability – they shape a culture where trust is reinforced at every level. I’ve experienced what happens when some of these are missing — and how powerful it is when they’re present.

My journey has taken me from founder to leader, and now, back to being a follower in another startup. This transition has taught me that great leadership isn’t about authority — it’s about discipline, clarity, and care.

Judge the value of your company

2025-09-22T00:00:00+00:00

After founding and forming my startup, I left my company. As the company found Product-Market-Fit, it operates well without my day-to-day monitoring. Setting up all techincal operations in the automated-manner pays off finally.

As my share becomes passive, which means common stock and not bound to the active founder, it becomes appealing for investors. One reason is to buy out when it is relative inexpensive. Another reason is to remove the potential red-flag when it comes to the next financial round. For some investors, big part of company share held left founder couldn’t be seen as ideal case.

Then, what decides the value of the stock in the private market, more exclusively in early stage startup. In traditional evaluation strategy, some factor * anual revenue or some factor * anual profit is applied. Howeer, anual profit is mostly negative for startups (especially, VC-backed startup; otherwise there’s no strong need of raising capital.) So, it’s often used for the mature business&industry.

See the Appendix below for the some factor

I have experienced multiple negotiation with investors, and yet it’s quite hard to find the common ground. It’s very natural to end up asking high for seller-side, and low for buyer-side. The value is up in the air, so nobody can objectively tell who is asking from the out of reasonalble range.

One strategy is to anchor the share price from the last round. Of course, there are factors to consider in this case.

The money is not invested into company which will have the potential to increase the inovation within the company.
If the share status is different (common vs preferred), its value differ.
The last round may have happened months or even years ago.
The business could have evolved—or deteriorated—over time, making that old valuation less reliable.

When expectations diverge, arguing over a single number rarely helps. What works better is structuring the deal so it moves with reality. Sometimes the price isn’t fixed at signing but adjusts later—say, when the next funding round sets a fresh valuation. Sometimes the sale happens in stages: a portion now, the rest when milestones are met. A neutral valuation can also help anchor both sides, and non-cash elements—faster liquidity, a role, introductions—can close the gap without simply raising the cash number. In the end, early-stage secondary sales are less about formulas and more about aligning incentives and trust. Clarity on risk, share rights, and the company’s trajectory does more to seal a deal than any spreadsheet multiple.

Appendix: Typical Valuation Multiples by Industry & Stage

Industry / Business Model	Company Stage / ARR / Size	Typical Revenue Multiple (EV/Revenue or ARR)	Notes & Considerations
SaaS / Subscription Software	Early‐stage, ARR < ≈ US$1-2M	~ 1× to 3×	If growth is modest, high churn, unproven retention. (Class VI Partners)
	Growth stage, ARR between US$1-10M	~ 2× to 6× (sometimes more, up to ~8× for strong metrics)	Higher growth, better metrics (low churn, good gross margins, recurring revenue) push toward upper end. (Aventis Advisors)
	Mature SaaS, ARR > US$10M & good retention/profitability	~ 5× to 10×+	More stable, less risk → higher multiple. But investors expect returns and some path to profitability. (Acquire.com Blog)
Tech / B2B (non-SaaS)	Smaller revenue ($1-5M range)	~ 2× to 3×	Includes hardware, non-recurring sales, services. Less recurring revenue → lower multiple. (First Page Sage)
	Medium size ($5-10-$75M)	~ 2.5× to 4×	Growth, scale, repeat business help. (First Page Sage)
Fintech	Private, high-growth fintechs	~ 3× to 5× revenue in many cases	If growth is very strong and risk tolerable, can go higher. Higher regulatory risk tends to suppress multiple somewhat. (First Page Sage)
Private SaaS / Bootstrapped	Smaller, bootstrapped SaaS firms	~ 4.5×-6×	Data from SaaS Capital: bootstrapped firms tend to get lower multiples than equity-backed, but still meaningful. (SaaS Capital)
Public SaaS / Big-cap	Larger, public SaaS companies	~ 5× to 15×+	When metrics (growth, margins, retention) are excellent, multiples at high end; otherwise lower. (Acquire.com Blog)

Leader To Follower

2025-09-10T00:00:00+00:00

Over my entrepreneurial journey, I have been in a leadership position to decide where to head myself and my team. Most of the time, I have been thinking about what direction has a higher chance of the impact. What project should be created for the team. Whom to promote, whom to hire, whom to encourage for better performance.

This year (2025) was a big move for me because I steped away from a co-foduned startup, and joined another startup with the similar growth stage. The stage of the new and previous companies is 8-10 employees and around a million Euro revenue, fast (though not hyper-fast).

As I joined an early stage-startup, there is still a lot of flexibility and responsibility, but far less than being a leader or founder. First thought I had after only a week of working as an individual contributor was that being in a leadership role pushes me to work more for the orginization than for the boss.

The pitfall of focusing on the wrong topic is very common in all types of work. Even if one has solved very tough and hard problems, it results nothing if it’s at the wrong timing or in the wrong direction.

Continuously seeking for feedback and alignment is crucial. Even if everything is aligned ans syced, sometimes one should look up and try to see the forest in their own view.

A leader is not necessarily right all-the-time. Of course, when everything is uncertain, makeing quick dicisions and iterating fast is essential. Then the whole team should trust what the leader is pointing at. However, as long as you have a strong evidence for altering the deicision. Go challenging your leader about one’s decision with the respect.

There will be a handful of leaders who might frustrate you for your challenge. Then, it’s an obvious sign telling you are not in the right place. When that happens, it’s time to seek another direction.

Rold of AI&ML in forecasting

2025-08-25T00:00:00+00:00

When it comes to predicting or forecasting something, folks—including marketers, customers, and especially investors—really like to hear that a product is using Artificial Intelligence (AI). It’s well-proven that in many problem domains like image classification, detection, and language processing, AI, and especially deep learning, has outperformed many statistical models. However, when it comes to predicting the future (or time-series forecasting), that’s mostly not the case. This isn’t just from my own experience, but also from stories coming from multiple industries and companies, as I’ve exchanged insights with various engineers.

Through my startup journey with successful (but very tough) fundraising rounds, I had to cope with multiple investor talks and Due Diligence prepration and it was my dilema if I had to lie our technology for the sake of money raising, because investors want to hear about deep learning.

The Reality of AI/ML in Forecasting

While AI/ML models can be incredibly powerful, they’re not always a magical solution for forecasting. The reality is more nuanced, and often, their use comes with significant trade-offs compared to traditional statistical methods.

One of the primary challenges is the “black box” nature of many advanced AI models. It can be difficult to understand why a particular forecast was made, which can be a major issue when a business needs to justify a decision to stakeholders. Traditional models, on the other hand, are often more interpretable and transparent.

Another key issue is the resource intensiveness of AI/ML models. They are notoriously data-hungry, requiring vast amounts of historical data to train effectively. This can be a major barrier for businesses with limited data. Additionally, the computational power required to train these complex models is significantly higher, leading to increased costs and energy consumption.

Finally, while AI models can capture incredibly complex, non-linear relationships, this flexibility comes with a risk: overfitting. This occurs when the model learns from the random noise in the data rather than the true underlying patterns, leading to poor performance on new, unseen data.

The Power of Gradient Boosted Trees

It’s also crucial to mention another major player in the ML forecasting world that isn’t Deep Learning: Gradient Boosted Trees. Models like XGBoost and LightGBM are frequently the winning solutions in data science competitions and are widely used in practical industry applications. These models are a type of ensemble learning that builds a strong predictive model by combining many “weak” decision trees. They are often more interpretable than neural networks, require less data and tuning, and can be incredibly powerful and robust.

So I can introduce this approach not only as a middle ground between shallow and deep, but also, more importantly, as a go-to solution as long as you have a minimum quantity of data, as it has the potential to capture most of the advantages—trend and seasonality—of statistical modeling (i.e., ARIMA, exponential smoothing) with the correct feature engineering. (It approaches the problem in a completely different way mathematically, but it has the potential to offer similar results.) Plus, it captures non-linearity in the data (and also handles non-stationarity).

A Prudent Approach to Forecasting

This isn’t to say that AI/ML has no place in forecasting. It’s an incredibly useful tool when applied to the right problem, especially in situations with complex, high-dimensional data or when forecasting a large number of related time series. In lots of business cases, data is often insufficient as they barely started to collect data and start project, or some data is either corrupted or intermittent.

Overall tip from me: the key is to view AI/ML not as a universal solution but as one option among many. For many forecasting challenges, a classic statistical model might be more cost-effective, easier to interpret, and just as accurate as a complex AI model. The most successful approach often involves a combination of techniques, using the strengths of each to build a robust and reliable forecasting system. The goal isn’t to use AI just for the sake of it, but to find the best tool for the job. Investigate and experiment with different level of aggregation and models.

Statistical Modeling and Objective function

2025-08-18T00:00:00+00:00

Through forecasting projects, I’ve learned a crucial lesson that’s often overlooked in introductory courses: your model is only as good as its loss function, and the default choice is often the wrong one.

We all start by learning to use Mean Squared Error (MSE) for our regression tasks. It’s simple, intuitive, and the default in most libraries. But blindly applying it to every problem is like trying to use a hammer for every job in a toolbox. For many real-world industrial cases, it’s a critical mistake. Let’s explore why.

The Default: MSE and its Hidden Assumption

When you train a Gradient Boosting model (or almost any regression model) using Mean Squared Error (MSE) as the loss function, you’re doing more than just penalizing large errors. You are implicitly making a powerful statistical assumption: that the errors of your model will follow a Normal (or Gaussian) distribution.

Symmetrical, continuous, and defined for all real numbers. MSE assumes residuals follow this shape.

This assumption works perfectly fine for certain problems. In the work with energy demand forecasting, the target variable (megawatt-hours) is continuous and typically high-volume. The demand curve is smooth, and the errors tend to cluster symmetrically around zero. In this scenario, the Normal distribution is a reasonable approximation of reality, and MSE is an excellent choice. It aggressively penalizes large deviations, pushing the model to be accurate for this high-stakes task.

But what happens when the data doesn’t look like a nice, continuous bell curve?

When the Default Fails: The World of Count Data

Now, let’s consider projects on inventory demand and strawberry yield forecasting. Here, the target variable is fundamentally different:

You can’t sell 2.43 spare parts.
A farmer counts 5 ripe strawberries, not 5.7.

This is count data. It’s discrete, non-negative, and often full of zeros. This is especially true for intermittent demand in inventory management, where a specific part might not be sold for days or weeks, followed by a sudden sale.

If you use MSE on this type of data, you’re asking your model to play by the wrong rules. The model might predict negative sales (-0.5 units) or fractional demand (1.25 units), which is nonsensical in the real world. This mismatch between the model’s assumption (Normal distribution) and the data’s reality (a discrete count process) leads to poor performance and unreliable forecasts.

However, here’s a catch: pragmatically, MSE works quite well if the dataset is continuous (not intermittent).

The Right Tools for the Job: Poisson & Negative Binomial

To correctly model count data, we need loss functions derived from distributions designed for counts. This is where the Poisson and Negative Binomial distributions come in.

Poisson: The Go-To for Simple Counts

The Poisson distribution models the probability of a number of events happening in a fixed interval. It’s perfect for data that consists of non-negative integer counts.

Discrete, non-negative, and defined only for integers. The “spikey” nature matches real-world counts.

By setting the objective function in my GBDT to poisson, I’m no longer just predicting a value; I’m telling the model to predict the rate ($\lambda$) of an event. I’m aligning the model’s objective with the data’s true nature. This immediately prevents predictions of negative values and grounds the model in the reality of a counting process.

Negative Binomial: For Messy, Real-World Counts

The Poisson distribution has one strict limitation: it assumes the mean and the variance of the data are equal. In my experience, real-world data is rarely this well-behaved.

Take inventory demand, for example. The average demand for a part might be 0.2 units per day, but when a sale occurs, it could be for 5 units. This “lumpy” demand is highly variable. This phenomenon, where the variance is much larger than the mean, is called over-dispersion.

This is where the Negative Binomial distribution shines. It’s a count data distribution like Poisson, but it has an extra parameter that allows the variance to be greater than the mean.

Handles over-dispersion gracefully—note the heavier tail compared to Poisson.

Using a Negative Binomial loss function (or a tweedie objective in many libraries) gives the model the flexibility to handle this spiky, over-dispersed data, leading to far more accurate and stable forecasts for lumpy demand patterns.

The Connection Between Objective Functions and Statistical Modeling via MLE

At the heart of every machine learning loss function lies a statistical distribution assumption. This connection is formalized through Maximum Likelihood Estimation (MLE), a fundamental principle that bridges probability theory and optimization.

1. The Principle of Maximum Likelihood Estimation (MLE)

MLE answers the question: “Given observed data, what model parameters make this data most probable?”

Likelihood Function: For a dataset $\{y_i\}_{i=1}^n$ and predictions $\{\hat{y}_i\}_{i=1}^n$, the likelihood under a distribution $P$ is:
$L(\theta) = \prod_{i=1}^n P(y_i | \theta, \hat{y}_i)$
Log-Likelihood: Converting products to sums for numerical stability:
$\log L(\theta) = \sum_{i=1}^n \log P(y_i | \theta, \hat{y}_i)$
Negative Log-Likelihood (NLL): Flipped to a minimization problem:
$\text{NLL}(\theta) = -\sum_{i=1}^n \log P(y_i | \theta, \hat{y}_i)$
This NLL becomes the loss function we minimize during training.

2. From Distributions to Loss Functions

The choice of distribution $P$ directly determines the loss function:

Distribution	Loss Function	Use Case
Normal (Gaussian)	MSE	Continuous, symmetric errors
Poisson	Poisson loss	Simple counts (mean ≈ variance)
Negative Binomial	NBinom/Tweedie loss	Over-dispersed counts

3. Why Does This Matter?

Correctness: Using MSE for count data violates the Normal assumption, leading to biased predictions (e.g., negative counts).
Efficiency: The right loss function (derived from the true data distribution) leads to faster convergence.
Interpretability: A Poisson/Negative Binomial loss directly models count rates, aligning with business logic (e.g., “units sold per day”).

Key Takeaway

Your loss function is a distributional assumption in disguise. Choose it based on the data’s statistical nature, not convenience. MLE provides the mathematical framework to derive the right loss function for your problem.

Understanding XGBoost: A Deep Dive into the Mathematics and From-Scratch Implementation

2025-08-06T00:00:00+00:00

Note: This post is initially composed with deep research and finalized by me.

Not only from my own experience in forecasting and predictive modeling in agriculture, automotive, and e-commerce industries, XGBoost (Extreme Gradient Boosting) is one of the go-to algorithms in many production-grade solutions, renowned for its performance in structured data tasks and competition-winning results. While widely used, its inner mechanics—especially the interplay between second-order optimization, regularization, and tree construction—are often treated as a black box for many ML peers.

In this post, I deconstruct XGBoost from first principles, deriving its core objective function, explaining its structural and weight-based regularization, and implementing a simplified version of the exact greedy algorithm in pure Python using only the standard library. This is intended for readers with a strong mathematical background who seek not just intuition, but rigorous understanding.

1. What Is XGBoost? The Additive Model Framework

XGBoost builds predictions using an ensemble of $K$ regression trees $f_k$ , combined additively:

\[\hat{y}_i = \sum_{k=1}^K f_k(x_i),\]

where:

$x_i \in \mathbb{R}^m$ is the $i$ -th input instance (a data point, like a customer record or house listing),
$\hat{y}_i$ is the predicted output (e.g., price, probability),
Each $f_k$ is a decision tree that maps inputs to real-valued scores.

At each boosting round $t$ , the model updates its prediction:

\[\hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + f_t(x_i).\]

The new tree $f_t$ is trained to correct the errors (residuals) of the previous model.

🔍 What Is an Instance?

An instance is simply a single row in your dataset—a specific example.

For example, if you’re predicting house prices:

Instance 1: [bedrooms=3, sqft=2000, year=2005] → price = $400,000
Instance 2: [bedrooms=2, sqft=1200, year=1998] → price = $250,000

So if you have $n$ rows, you have $n$ instances: $x_1, x_2, ..., x_n$ . Each $x_i \in \mathbb{R}^m$ is a vector of $m$ numerical or encoded features.

🔍 What Is a Leaf in a Decision Tree?

A decision tree works like a flowchart. It asks a series of yes/no questions (e.g., “Is size > 1500?”) and routes each instance down a path until it reaches a terminal node, called a leaf.

Each leaf acts like a “bucket” that collects similar instances and assigns them a prediction value—called the leaf weight, $w_j$ .

If a tree has $T$ leaves, then every instance ends up in exactly one leaf.

👉 So the tree doesn’t give a unique output for every instance—it groups similar instances and gives them the same score.

This is why decision trees are piecewise constant functions: they partition the input space and assign a constant value to each region.

2. The XGBoost Objective: Balancing Fit and Complexity

The goal at step $t$ is to learn $f_t$ that minimizes the total objective:

\[\text{obj}^{(t)} = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t),\]

where:

$l(\cdot)$ : differentiable convex loss (e.g., squared error, logistic loss),
$\Omega(f_t)$ : regularization term penalizing tree complexity.

XGBoost uses:

\[\Omega(f_t) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2,\]

with:

$T$ : number of leaves,
$w_j$ : score (leaf weight) at leaf $j$ ,
$\gamma \geq 0$ : penalty per leaf (controls tree depth),
$\lambda \geq 0$ : L2 penalty on leaf weights (controls overfitting).

This dual penalty discourages both deep trees (via $\gamma$ ) and large leaf outputs (via $\lambda$ ).

3. Second-Order Taylor Expansion: Making the Objective Tractable

Since $f_t(x_i)$ is piecewise constant and non-differentiable, we can’t directly optimize the objective. Instead, XGBoost uses a second-order Taylor expansion of the loss $l$ around the current prediction $\hat{y}_i^{(t-1)}$ :

\[l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) \approx l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t(x_i)^2,\]

where:

$g_i = \partial_{\hat{y}} l(y_i, \hat{y}_i^{(t-1)})$ : first derivative (gradient),
$h_i = \partial^2_{\hat{y}} l(y_i, \hat{y}_i^{(t-1)})$ : second derivative (Hessian).

Dropping constants, the working objective becomes:

\[\text{obj}^{(t)} \approx \sum_{i=1}^n \left[ g_i f_t(x_i) + \frac{1}{2} h_i f_t(x_i)^2 \right] + \Omega(f_t).\]

This is now a quadratic function in $f_t(x_i)$ , which we can optimize efficiently.

4. Tree Structure and Leaf-Wise Optimization

Let’s define:

$q: \mathbb{R}^m \to \{1,\dots,T\}$ : the tree structure, a function that assigns each input $x_i$ to a leaf $j$ .
$I_j = \{ i \mid q(x_i) = j \}$ : the set of instances that fall into leaf $j$ .

Then, by definition: $f_t(x_i) = w_j \quad \text{for all } i \in I_j.$

That is, all instances in the same leaf get the same prediction from this tree: $w_j$ .

🧮 Rewriting the Objective by Leaves

Instead of summing over instances one by one, we can group them by leaf:

\[\sum_{i=1}^n g_i f_t(x_i) = \sum_{j=1}^T \sum_{i \in I_j} g_i w_j = \sum_{j=1}^T \left( \sum_{i \in I_j} g_i \right) w_j\] \[\sum_{i=1}^n \frac{1}{2} h_i f_t(x_i)^2 = \sum_{j=1}^T \sum_{i \in I_j} \frac{1}{2} h_i w_j^2 = \sum_{j=1}^T \frac{1}{2} \left( \sum_{i \in I_j} h_i \right) w_j^2\]

Now include the regularization $\Omega(f_t) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2$ , and combine:

\[\text{obj}^{(t)} = \sum_{j=1}^T \left[ \left( \sum_{i \in I_j} g_i \right) w_j + \frac{1}{2} \left( \sum_{i \in I_j} h_i + \lambda \right) w_j^2 \right] + \gamma T\]

Let:

$G_j = \sum_{i \in I_j} g_i$ : total gradient in leaf $j$ ,
$H_j = \sum_{i \in I_j} h_i$ : total Hessian in leaf $j$ .

Then: $\text{obj}^{(t)} = \sum_{j=1}^T \left[ G_j w_j + \frac{1}{2} (H_j + \lambda) w_j^2 \right] + \gamma T$

✅ Optimal Leaf Weight

This expression is a sum of independent quadratics in $w_j$ , so we can minimize each term separately.

Take derivative w.r.t. $w_j$ , set to zero:

\[G_j + (H_j + \lambda) w_j = 0 \quad \Rightarrow \quad w_j^* = -\frac{G_j}{H_j + \lambda}\]

This is the best possible value for the leaf score $w_j$ , given which instances are in that leaf.

📉 Optimal Objective Value

Substitute $w_j^*$ back in:

\[\text{obj}^* = -\frac{1}{2} \sum_{j=1}^T \frac{G_j^2}{H_j + \lambda} + \gamma T\]

This value depends only on how the tree partitions the data (i.e., which $I_j$ sets are formed). It serves as a score for evaluating tree quality: lower is better.

5. Split Evaluation: The Gain Formula

To grow the tree, we ask: “Should we split this node?” and “How?”

Suppose a node contains instance set $I$ , and we consider splitting it into left ($I_L$ ) and right ($I_R$ ) children.

Let:

$G = \sum_{i \in I} g_i$ , $H = \sum_{i \in I} h_i$ ,
$G_L = \sum_{i \in I_L} g_i$ , $H_L = \sum_{i \in I_L} h_i$ ,
$G_R = G - G_L$ , $H_R = H - H_L$ .

The gain from this split is the reduction in objective:

\[\text{Gain} = \underbrace{\left[ -\frac{1}{2} \frac{G_L^2}{H_L + \lambda} -\frac{1}{2} \frac{G_R^2}{H_R + \lambda} \right]}_{\text{after split}} - \underbrace{\left[ -\frac{1}{2} \frac{G^2}{H + \lambda} \right]}_{\text{before split}} - \gamma\]

Simplifying:

\[\text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda} \right] - \gamma\]

🧠 Interpretation

The first two terms: improvement from having two specialized leaves.
The third term: loss from the original (unsplit) leaf.
$-\gamma$ : penalty for adding a new leaf.

✅ A split is accepted only if $\text{Gain} > 0$ . This acts as pre-pruning: only splits that improve the regularized objective are made.

🔥 Note: The original paper uses $-\gamma$ , not $-\gamma/2$ . Some implementations mistakenly halve the penalty, weakening regularization.

6. The Exact Greedy Algorithm: How Splits Are Found

The exact greedy algorithm finds the best split by:

For each feature, sort instances by feature value.
For each adjacent pair $(x_i, x_{i+1})$ , consider a candidate split at $\frac{x_i + x_{i+1}}{2}$ .
For each candidate, compute the gain.
Choose the split with maximum gain (if positive).

This ensures all possible binary splits are evaluated.

🌲 Handling Ordinal Features

XGBoost treats integer-encoded ordinal features (e.g., education level 1–8) as continuous. It generates splits at midpoints like 1.5, 2.5, etc., enforcing monotonic partitions. This is efficient but assumes monotonicity—non-monotonic relationships may be missed.

⏱️ Computational Complexity

Per node: $O(d \cdot n \log n)$ , due to sorting $d$ features. This is expensive for large datasets, motivating approximate methods (e.g., quantile sketching).

🛑 Split Validity: `min_child_weight`

A split is invalid if: $H_L < \text{min\_child\_weight} \quad \text{or} \quad H_R < \text{min\_child\_weight}.$

This ensures sufficient data and curvature in each child. In classification, $h_i = p_i(1 - p_i) \to 0$ near decision boundaries, so this prevents overconfident splits on small or certain groups.

7. From-Scratch Python Implementation (Standard Library Only)

Here’s a minimal, pure-Python implementation of the exact greedy algorithm.

from typing import List, Tuple, Optional

class TreeNode:
    def __init__(self):
        self.left: Optional['TreeNode'] = None
        self.right: Optional['TreeNode'] = None
        self.feature: Optional[int] = None
        self.threshold: Optional[float] = None
        self.value: Optional[float] = None  # leaf output

class TreeBooster:
    def __init__(self, max_depth: int = 3, min_child_weight: float = 1.0,
                 gamma: float = 0.0, reg_lambda: float = 1.0):
        self.max_depth = max_depth
        self.min_child_weight = min_child_weight
        self.gamma = gamma
        self.reg_lambda = reg_lambda
        self.root: Optional[TreeNode] = None

    def _compute_gradients(self, y_true: float, y_pred: float) -> Tuple[float, float]:
        # Example: squared error loss
        grad = y_pred - y_true
        hess = 1.0  # hessian = 1 for squared loss
        return grad, hess

    def _find_better_split(self, X: List[List[float]], y_true: List[float], y_pred: List[float],
                           idxs: List[int]) -> Tuple[Optional[int], Optional[float], Optional[float]]:
        best_gain = -float('inf')
        best_feature = None
        best_threshold = None
        best_value = None

        # Compute total G and H for current node
        G = sum(self._compute_gradients(y_true[i], y_pred[i])[0] for i in idxs)
        H = sum(self._compute_gradients(y_true[i], y_pred[i])[1] for i in idxs)

        if H < self.min_child_weight:
            return None, None, None

        n_features = len(X[0])
        for feature_idx in range(n_features):
            # Sort indices by feature value
            sorted_idxs = sorted(idxs, key=lambda i: X[i][feature_idx])
            values = [X[i][feature_idx] for i in sorted_idxs]

            GL, HL = 0.0, 0.0
            GR, HR = G, H

            for i in range(len(sorted_idxs) - 1):
                idx = sorted_idxs[i]
                g, h = self._compute_gradients(y_true[idx], y_pred[idx])
                GL += g; HL += h
                GR -= g; HR -= h

                # Avoid duplicate values
                if values[i] == values[i + 1]:
                    continue

                threshold = (values[i] + values[i + 1]) / 2.0

                # Check child weight
                if HL < self.min_child_weight or HR < self.min_child_weight:
                    continue

                gain = 0.5 * (
                    (GL**2 / (HL + self.reg_lambda)) +
                    (GR**2 / (HR + self.reg_lambda)) -
                    ((GL + GR)**2 / (HL + HR + self.reg_lambda))
                ) - self.gamma

                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature_idx
                    best_threshold = threshold
                    best_value = -GL / (HL + self.reg_lambda)

        if best_gain <= 0:
            return None, None, None
        return best_feature, best_threshold, best_value

    def _build_tree(self, X: List[List[float]], y_true: List[float], y_pred: List[float],
                    idxs: List[int], depth: int) -> TreeNode:
        node = TreeNode()

        if depth >= self.max_depth:
            G = sum(self._compute_gradients(y_true[i], y_pred[i])[0] for i in idxs)
            H = sum(self._compute_gradients(y_true[i], y_pred[i])[1] for i in idxs)
            node.value = -G / (H + self.reg_lambda)
            return node

        feature, threshold, value = self._find_better_split(X, y_true, y_pred, idxs)

        if feature is None:
            G = sum(self._compute_gradients(y_true[i], y_pred[i])[0] for i in idxs)
            H = sum(self._compute_gradients(y_true[i], y_pred[i])[1] for i in idxs)
            node.value = -G / (H + self.reg_lambda)
            return node

        node.feature = feature
        node.threshold = threshold

        left_idxs = [i for i in idxs if X[i][feature] <= threshold]
        right_idxs = [i for i in idxs if X[i][feature] > threshold]

        node.left = self._build_tree(X, y_true, y_pred, left_idxs, depth + 1)
        node.right = self._build_tree(X, y_true, y_pred, right_idxs, depth + 1)

        return node

    def fit(self, X: List[List[float]], y: List[float], base_pred: List[float]):
        idxs = list(range(len(X)))
        self.root = self._build_tree(X, y, base_pred, idxs, 0)

    def _predict_row(self, row: List[float]) -> float:
        node = self.root
        while node.value is None:
            if row[node.feature] <= node.threshold:
                node = node.left
            else:
                node = node.right
        return node.value

    def predict(self, X: List[List[float]]) -> List[float]:
        return [self._predict_row(row) for row in X]

✅ Example Usage

X = [[1.0], [2.0], [3.0], [4.0]]
y = [2.0, 4.0, 6.0, 8.0]
base_pred = [0.0] * len(y)

model = TreeBooster(max_depth=2, reg_lambda=1.0)
model.fit(X, y, base_pred)
print(model.predict(X))  # Outputs tree-based corrections

⚠️ This is a pedagogical implementation. Real XGBoost uses histograms, sparsity, and parallelism for speed.

8. Key Insights

Second-order optimization enables precise leaf weight estimation.
Dual regularization ($\gamma$ and $\lambda$ ) controls structure and output magnitude.
Gain-based splitting ensures only meaningful splits are made.
Leaf grouping allows closed-form optimization of $w_j$ .
Hessian-aware pruning improves stability in uncertain regions.

9. Conclusion

XGBoost’s power lies in its principled fusion of:

Gradient boosting,
Second-order optimization,
Structural and weight regularization.

By understanding how instances are grouped into leaves and how leaf weights are optimized, we move beyond treating XGBoost as a black box.

It’s not magic—it’s math, carefully engineered.

References
[1] Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. KDD.
[2] Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine.
[3] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.

My Expert Forecasts Got Crushed by a Dumb Algorithm

2025-08-04T00:00:00+00:00

Note: This post is initially composed with deep research and finalized by me.

I have to admit something humbling. After spending months building what I thought were good forecasting models (MAPE less than 15%), I ran a check against the simplest possible approach: a naive forecast (moving average). The result? My expert judgment, backed by complex models and deep domain knowledge, was consistently less accurate. 🤦‍♂️

This wasn’t just a fluke. It was a classic “expert paradox,” an experience that forced me to re-evaluate how we approach forecasting. It turns out that our greatest asset—our expert brain—can also be our biggest liability, thanks to a host of cognitive biases. This post is my deep dive into why simple baselines are so powerful and how I’m changing my approach to forecasting for good.

The Underrated Power of Simple Baselines

Before we get into why my expert judgment failed, let’s talk about the simple, “dumb” methods that beat me. These are essential baselines that should be the starting point for any forecasting project. They are the ultimate test of whether your complex model is actually adding any value.

Random Walk Forecast: This is the simplest of all. It assumes the forecast for the future is just the last observed value. It’s surprisingly effective for many financial and economic time series.
- The logic: “Tomorrow will be the same as today.”
- The math: $\hat{y}_{T+h|T} = y_T$ (The forecast $\hat{y}$ for a future time $T+h$ is simply the value at the last observation $y_T$.)
Seasonal Naive Method: This is for data with clear seasonal patterns. The forecast is simply the value from the same season last year.
- The logic: “This April will be like last April.”
- The math: $\hat{y}_{T+h|T} = y_{T+h-m}$ (The forecast is the value from one seasonal period $m$ ago.)
Average Method: Here, the forecast for all future periods is just the mean of all your historical data. It’s stable but doesn’t adapt to recent changes.
- The logic: “The future will be the average of the past.”
- The math: $$\hat{y}_{T+h T} = \bar{y} = \frac{1}{T}\sum_{t=1}^{T}y_t$$

These methods are powerful because they are objective, transparent, and robust against overfitting. They don’t have egos or bad days. They mechanically execute a simple rule, which protects them from the very thing that tripped me up: cognitive bias.

Why My Brain Failed Me: The Peril of Cognitive Biases

So, why did my judgment perform so poorly? Because even with deep expertise, the human mind takes mental shortcuts (heuristics) that lead to systematic errors. A mechanical approach like the naive method is immune to these. Looking back, I could see these biases clearly in my own thinking.

Cognitive Bias	My Experience in a Nutshell	Impact on My Forecast
Anchoring Bias	I was stuck on an initial piece of information, like last quarter’s strong performance, and didn’t adjust enough for new data.	My forecasts were too slow to react to changing market conditions.
Overconfidence Bias	I was excessively confident in my own judgment, believing I could “outsmart” the market’s randomness. This made me downplay uncertainty.	My prediction intervals were way too narrow, and I was shocked by “unexpected” outcomes.
Confirmation Bias	I subconsciously looked for news articles and data points that confirmed my initial hypothesis, while ignoring contradictory evidence.	I created an echo chamber that reinforced my flawed assumptions.
Availability Heuristic	A recent, dramatic market event was fresh in my mind, causing me to overweight its importance for future predictions.	My forecasts overreacted to salient but not necessarily representative information.
Optimism / Wishful Thinking	Because I was invested in a project’s success, I was hesitant to forecast a negative outcome. My hopes clouded my realistic estimates.	My forecasts were consistently too optimistic, driven by what I wanted to happen.

These biases aren’t a sign of incompetence; they are a fundamental part of human cognition. My mistake wasn’t having them; it was failing to have a system in place to counteract them. The simple naive forecast won because it was, by its very nature, unbiased.

My New Forecasting Playbook: A Hybrid Approach

This humbling experience led me to adopt a new, more rigorous forecasting framework. It’s all about leveraging the strengths of both statistical methods and expert judgment while mitigating their weaknesses.

1. Baselines Are Mandatory

No complex model gets a pass without first proving it can consistently beat a simple baseline (like naive or seasonal naive). If your fancy ML model can’t outperform “tomorrow will be like today,” you have a problem. This is my non-negotiable first step.

2. Systematically Mitigate Bias

I can’t eliminate my cognitive biases, but I can tame them. My new process involves:

Writing down my assumptions before making a forecast.
Seeking out disconfirming evidence on purpose.
Separating the forecast from the target. A forecast is a realistic estimate, not a goal. These two things must be kept separate to avoid wishful thinking.

3. Combine, Don’t Just Choose

The best approach is often a hybrid one. I now use a simple statistical forecast as my objective anchor. Then, I use my expert judgment to make structured adjustments for things the model can’t possibly know—a new product launch, a planned marketing campaign, or a looming supply chain disruption. This combines the objectivity of the machine with the contextual intelligence of the human.

4. Prioritize Interpretability

A “black box” forecast that no one understands is a forecast no one trusts or acts on. Simple models are transparent by nature. For more complex models, I now put extra effort into explaining the “why” behind the numbers. An accurate forecast is useless if it’s not actionable.

Final Thoughts

Getting beaten by a simple algorithm was a crucial lesson in intellectual humility. It taught me that in the world of forecasting, complexity is not a virtue in itself. Objectivity and consistency often trump subjective expertise.

My advice now is simple: respect the power of simplicity. Start with a naive baseline, be ruthlessly honest about your own cognitive biases, and build a process that combines the best of both worlds—statistical rigor and expert insight. Your accuracy will thank you for it.

Forecasting Accuracy Metrics

2025-08-02T00:00:00+00:00

Note: This post is initially composed with deep research and finalized by me.

Beyond MAPE & RMSE: A Strategic Guide to Forecasting Metrics

Why Forecast Accuracy Matters More Than Ever

From Strawberry Fields to Energy Grids: My journey through agricultural yield forecasting at hexafarms (predicting strawberry and tomato yields 1-4 weeks ahead) and now demand forecasting in e-commerce and energy has taught me this: forecasting isn’t about crystal balls—it’s about quantifying uncertainty to drive better decisions. Whether it’s a farmer deciding harvest schedules, an e-commerce manager allocating inventory, or an energy trader balancing grids, the stakes of forecast accuracy are always tangible.

The Core Challenge: In agriculture, a 10% underforecast could mean rotting strawberries. In energy, it could mean emergency gas purchases at peak prices. Actual results will deviate—our job is to answer three questions:

Are deviations within expected bounds?
What do they reveal about our models?
How do we communicate uncertainty to decision-makers?

The Stakes Are Always Concrete:

Agricultural Forecasting: 1kg underprediction → €2 wasted strawberries; overprediction → missed €4/kg market premiums
E-Commerce: 10% demand misforecast → 15% excess inventory costs or 20% stockout losses
Energy: 5% load forecast error → €200k imbalance penalties per trading period

The Evolution I’ve Lived: Early in my career, we obsessed over single-point accuracy (MAPE for yields). Today, I advocate for probabilistic thinking—whether predicting the 90th percentile of tomato harvests or the distribution of next week’s EV charging demand. The breakthrough moment? When our team replaced “We’ll harvest 12 tons” with “80% chance of 10-14 tons” and reduced food waste by 23%.

Demystifying Point Forecast Metrics: A Taxonomy

No single metric fits all scenarios. Understanding their construction and properties is key to strategic selection. Metrics are built from three core components:

Point Distance (D): How is the error (Actual - Forecast) measured? (e.g., raw error, absolute error, squared error, log ratio).
Normalization (N): How is the error scaled? (e.g., not scaled, divided by actual, scaled by benchmark error).
Aggregation (G): How are individual errors combined? (e.g., mean, median, geometric mean, sum).

This framework explains why metrics behave as they do. Below is a comparative analysis of common metrics:

Metric (Abbrev)	Formula (Simplified)	Distance (D)	Normalization (N)	Aggregation (G)	Key Properties	Strengths	Weaknesses
Mean Abs Error (MAE)	`MAE = (1/n) * Σ \|Actual - Forecast\|`	Absolute (D2)	Unitary (N1)	Mean (G1)	Scale-dependent, preserves units, equal weighting.	Intuitive, easy to compute & interpret (avg error magnitude).	Not comparable across series; non-differentiable.
Mean Sq Error (MSE)	`MSE = (1/n) * Σ (Actual - Forecast)²`	Squared (D3)	Unitary (N1)	Mean (G1)	Scale-dependent (squared units), emphasizes large errors.	Differentiable; penalizes large errors; good for model discrimination.	Highly sensitive to outliers; not comparable; units unintuitive.
Root MSE (RMSE)	`RMSE = √MSE`	Squared (D3)	Unitary (N1)	Mean (G1)	Scale-dependent (original units).	Brings MSE back to data scale; interpretable.	Highly sensitive to outliers; not comparable; criticized for cross-series use.
Mean Abs % Error (MAPE)	`MAPE = (100/n) * Σ \|(Actual - Forecast)/Actual\|`	Absolute (D2)	By Actuals (N2)	Mean (G1)	Scale-independent.	Popular; easy to interpret as % error.	Undefined if Actual=0; skewed near zero; penalizes over-forecasts more; asymmetric.
sMAPE	`sMAPE = (100/n) * Σ (2*\|e\|)/(\|Actual\|+\|Forecast\|)`	Absolute (D2)	Sum Actual+Pred (N4)	Mean (G1)	Scale-independent.	Attempts symmetry.	Problematic near zero; can be negative; still asymmetric in practice.
Median Abs % Error (MdAPE)	`MdAPE = Median( \|(Actual - Forecast)/Actual\| * 100 )`	Absolute (D2)	By Actuals (N2)	Median (G2)	Scale-independent, median-based.	Robust to large outliers; isolates accuracy from bias; easy to calculate.	Less sensitive than mean; handling of zeros not always explicit.
Mean Abs Scaled Error (MASE)	`MASE = mean( \|e_t\| / Q )` Q = MAE of in-sample naïve forecast	Absolute (D2)	Variability (N3)	Mean (G1)	Scale-free, robust denominator.	Never undefined; handles intermittent data; comparable across series; handles trend/seasonality.	Requires historical data for benchmark; interpretation needs context (Q).

Understanding the Categories

Scale-Dependent Metrics (MAE, MSE, RMSE): Error is in the original data units. Pro: Intuitive magnitude. Con: Useless for comparing series on different scales (e.g., sales of pencils vs. trucks).
Percentage-Error Metrics (MAPE, sMAPE, MdAPE): Error expressed as a percentage of actuals. Pro: Scale-independent, good for cross-series comparison. Con: Catastrophic failure with zero actual values (MAPE/sMAPE). MAPE biased, sMAPE ambiguous.
Relative-Error Metrics (MdRAE, GMRAE): Error relative to a benchmark (e.g., naïve forecast). Pro: Scale-independent. Con: Undefined if benchmark error is zero (common with intermittent data).
Scale-Free Metrics (MASE): Error scaled by in-sample benchmark error. Pro: Universally applicable, robust to zeros/intermittency, comparable. Gold standard for intermittent demand.

The Intermittent Demand Challenge

Sales of slow-moving items (e.g., specific lubricants) are characterized by many zeros and sporadic spikes. This wreaks havoc on traditional metrics:

MAPE/sMAPE: Explode (Actual = 0).
GMAE: Becomes zero (Error = 0 anywhere).
Relative Metrics: Explode (Benchmark Error = 0).
MASE: Thrives. Its denominator (in-sample naïve MAE) is always available and non-zero (unless all historical data is identical, which is rare).

MASE is the unequivocal champion for intermittent demand forecasting evaluation.

The Myth of the “Best” Metric

The quest for a single, universally “best” metric is futile. Here’s why:

Different Goals: Minimizing average error (MAE) vs. avoiding large errors (RMSE) require different metrics.
Different Data: Continuous sales vs. intermittent demand demand different metrics (MASE over MAPE).
Different Perspectives: Metrics offer distinct “projections” of error characteristics (bias, variance, outliers).
Evolving Understanding: Dominant metrics have shifted over decades (MSE -> MAPE -> MASE).
Criticism Abounds: Even popular metrics (RMSE, MAPE) have well-documented flaws.
Error is Random: A single number cannot fully describe a random variable’s distribution.

Selection is strategic: Choose metrics based on business objective, data characteristics, and interpretability needs.

Handling Outliers & Distribution

Squared Errors (MSE/RMSE): Highly sensitive. Amplify large errors. Use if large errors are exceptionally costly.
Absolute Errors (MAE): Less sensitive. Equal weight to all errors.
Median-Based (MdAPE, MdAE): Most robust. Reflect typical performance, ignoring extremes.
Percentage Metrics (MAPE): Can be extremely skewed if actuals are near zero.

Leveling Up: Advanced Evaluation Techniques

Time Series Cross-Validation (Rolling Forecast Origin)

Forget simple train/test splits. Time series data demands chronological preservation:

Start with initial training window (e.g., first k observations).
Forecast next h periods.
Move origin forward by one period, add new observation to training set.
Repeat steps 2 & 3.
Aggregate out-of-sample errors across all origins.

Why it’s critical: Simulates real-world forecasting. Prevents overfitting to historical patterns. Provides reliable estimate of future performance. (Minimizing Gaussian AIC ≈ Minimizing 1-step MSE via this method).

Evaluating Probabilistic Forecasts

Assessing a full distribution against a single observation (e.g., seasonal peak demand) is hard. Key qualities:

Calibration (Reliability): Does the forecast distribution match reality? (e.g., 90% prediction intervals should contain the actual ~90% of the time). Tested via uniformity of PIT (Probability Integral Transform) histograms.
Sharpness: How narrow are the prediction intervals? Tighter intervals indicate more confident/useful forecasts (if calibrated!).

Scoring Rules (Combine Calibration & Sharpness): Numerical measures rewarding accurate distributions. “Proper” scoring rules incentivize truthful forecasts.

Continuous Ranked Probability Score (CRPS): Generalizes MAE. Integral of squared difference between forecast CDF and step function at observation. Lower = Better.
Dawid-Sebastiani Score (DSS): Generalizes MSE. Assumes normality.
Energy Score (ES), Log Score: Other proper scores.
Empirical Fit Metrics (for multiple observations):
- Mean Absolute Excess Probability (MAEP): Avg. absolute diff between proportion of obs exceeding a percentile and that percentile. → 0 for perfect fit.
- Kolmogorov-Smirnov (KS) Statistic: Max absolute diff between empirical and forecast CDF. → 0 for perfect fit. Scale-independent.

Residual Analysis: The Diagnostic Powerhouse

Residuals (Observed - Fitted) are gold for model improvement. A good model’s residuals should be:

Uncorrelated: No lingering patterns (check ACF/PACF).
Zero Mean: No systematic bias.
Constant Variance (Homoscedasticity): Spread doesn’t change with level.
(Ideally) Normally Distributed: For reliable prediction intervals.

Practical Use (e.g., AEMO):

Analyze residuals near forecast extremes (min/max demand).
Compare residuals generating simulated extremes vs. residuals from actual extremes.
Reveals where distributional assumptions break or model fails critically.

Lessons from the Trenches: AEMO & KPMG

Australian Energy Market Operator (AEMO)

AEMO forecasts critical energy metrics, facing intense scrutiny. Their practices highlight context-driven assessment:

Annual Consumption (Point Forecast): Uses Percentage Error (PE) for simplicity. Enhances communication via:
- Waterfall plots showing contribution of input errors (GSP, weather).
- Context on input error impact.
Min/Max Probabilistic Demand: Employs a multi-pronged approach:
- Qualitative Comparison (FAR): Reports where observed demand fell in forecast distribution + context. Vital for stakeholders.
- Probabilistic Drivers (FAR): Reports ranges of key drivers (temp, time) at simulated extremes. Recommendation: Visualize distributions overlaid with actuals.
- MAEP & KS (Internal PD): Core technical metrics for distribution fit.
- Pinball Loss (Relative Score - Internal): For comparing probabilistic models. Recommendation: Normalize by observed value, not forecast quantile.
- Discontinued: Backcasting (didn’t assess forecast accuracy).
- Recommended: Full-Season Hindcasting (compare forecast using actual historical inputs vs. historical forecast) & Simulated History (apply current model to past data).
- Residual Analysis (Internal PD): Formalize near extremes and simulated vs. actual extremes. Critical for diagnosing regression-simulation framework.
- Adaptability: Re-evaluate metrics if forecasting methodology changes.

Key Takeaway: Rigorous internal diagnostics (PD) differ from clear, contextual stakeholder reporting (FAR).

KPMG (Financial Forecasting)

Focuses on Prospective Financial Information (PFI) accuracy. Highlights benchmarking nuances:

Preferred Metric: Median Absolute Percentage Error (MdAPE). Robust to outliers, isolates accuracy from bias. (e.g., A large actual change might double MAPE but leave MdAPE stable).
Power of Benchmarking: Compare company MdAPE against peers.
- Hypothetical: Company MdAPE=7.5% vs. S&P 500 Median=4.0%, Lower Quartile=8.8% → Accuracy is below median but not in the worst quartile.
Benchmarking Caveats:
- Industry Specificity: S&P 500 too broad. Use industry-specific peers.
- Statistical Significance: Beware outliers in benchmarks.
- Temporal Context: Accuracy varies over time (e.g., COVID-19 massively increased MdAPE in 2020). Compare relevant periods.

Key Takeaway: A “good” MdAPE is relative to your industry and the economic climate.

Strategic Recommendations & Conclusion

Choosing Your Metrics Wisely

Select metrics strategically, considering:

Factor	Considerations	Metric Examples
Cross-Series Comparison	Essential?	Scale-Independent: MASE, MAPE (if no zeros), sMAPE (cautiously), MdAPE, Relative Metrics
Zero Values / Intermittency	Present?	Robust: MASE (Champion)
Outlier Sensitivity	Are large errors catastrophic?	Penalize Large Errors: MSE, RMSE
	Should typical performance be assessed?	Robust to Outliers: MAE, MdAPE, MdAE
Detecting Bias	Is systematic over/under-forecasting a concern?	Bias Metrics: Mean Error (ME), Mean Percentage Error (MPE)
Stakeholder Interpretability	Need simple communication?	Intuitive: MAE, MAPE (if applicable), PE, MdAPE

Embrace Hybrid Metric Sets

No single metric tells the whole story. Combine complementary metrics to get a comprehensive view:

Example 1: MAE (Accuracy) + ME (Bias).
Example 2: MASE (Overall Robust Accuracy) + MdAPE (Robust Typical % Error).
Example 3 (Probabilistic): CRPS (Overall Prob. Score) + Calibration Plot + Sharpness Measure.

Research opportunities exist in identifying non-redundant, maximally informative hybrid sets.

Conclusion: A Framework for Confidence

Effective forecast evaluation is not about finding a magic number. It’s a strategic discipline requiring:

Understanding Metric Properties: Know how they are built (D, N, G) and how they behave.
Knowing Your Data: Is it intermittent? Prone to outliers? Seasonal? This dictates valid metrics.
Aligning with Business Goals: What type of error is most costly? What do stakeholders need to understand?
Leveraging Advanced Techniques: Use Time Series CV for reliable out-of-sample estimates. Employ scoring rules and calibration/sharpness checks for probabilistic forecasts. Harness residual analysis for diagnostics.
Learning from Practice: Adapt approaches like AEMO’s context-driven reporting or KPMG’s benchmarked MdAPE.
Using Hybrid Sets: Gain a multi-dimensional view of performance.

By adopting this rigorous, context-aware, and multi-faceted framework, organizations can move beyond simplistic accuracy measures, build genuine confidence in their forecasts, and make significantly better decisions in an uncertain world. Continuous evaluation drives continuous improvement.

The Linear Gaussian State Model: Dynamic Systems

2025-08-01T00:00:00+00:00

Note: This post is initially composed with deep research and finalized by me.

1. Introduction to State-Space Models

State-space models offer a robust and flexible framework for analyzing and understanding dynamic systems, particularly those evolving over time, such as time series data. This approach provides a powerful lens through which to conceptualize a system’s behavior, distinguishing between its internal, unobservable dynamics and its external, measurable manifestations. The framework is especially adept at managing multivariate data and intricate dynamic relationships, providing a systematic method for modeling processes under uncertainty.

1.1. General Concept of State-Space Modeling

At its core, state-space modeling posits that the true condition of a system at any given time can be encapsulated by a set of unobserved, or “hidden,” state variables. These hidden states are not directly measurable but dictate the system’s evolution and its observable outputs. The fundamental concept involves separating the true, underlying system state—which is often latent or difficult to measure without error—from the noisy, incomplete observations that are actually collected. This separation is a crucial aspect, as it allows for the robust estimation of the true state even when measurements are imperfect or corrupted by noise.

For instance, consider the true position and velocity of an aircraft. While these are the actual states of interest, they cannot be measured perfectly due to sensor limitations and atmospheric disturbances. What is observed are noisy radar readings or GPS signals. State-space models address this challenge by introducing a hidden state variable that represents the aircraft’s true, uncorrupted position and velocity. By modeling how this hidden state evolves and how it relates to the noisy observations, it becomes possible to infer the unobservable truth. This capability to estimate the underlying reality from imperfect data is a significant advantage, addressing a common challenge in real-world data analysis where noise and incompleteness are prevalent.

1.2. The Hidden Markov State Process

A defining characteristic of state-space models is the assumption of a hidden Markov state process for the evolution of the state vector (xₜ). This means that the state of the system at time t depends solely on its state at the previous time step, t−1, along with any exogenous inputs, conditional on all past information. In essence, once the current state is known, all prior states and observations become irrelevant for predicting the future state.

This Markovian assumption, while appearing restrictive, is foundational for the computational efficiency of state estimation algorithms, such as the Kalman filter. Many complex time series problems involve dependencies on a long history of past states and inputs, which can lead to high-dimensional and computationally intractable challenges. By assuming that the future state depends only on the immediate past state, this “memoryless” property (conditional on the previous state) allows for recursive algorithms. These algorithms update the state estimate sequentially, requiring only the previous state estimate and the current observation. Without this pragmatic simplification, exact inference would often be computationally prohibitive for many real-world applications, significantly limiting the practical utility of the state-space framework.

1.3. Observations Independent Given the States

Another core principle of state-space models is that the observations (yₜ) at time t are conditionally independent of all other observations and past states, given the current state xₜ.

This implies that once the true state at time t is known, the observation yₜ provides no additional information about past states or future observations beyond what is already captured by xₜ.

This assumption effectively separates the measurement process from the system’s internal dynamics. The noise inherent in the observation equation is considered to be purely measurement error, uncorrelated with the system’s internal evolution. This separation is crucial for distinguishing between true system changes and inaccuracies arising from the measurement process. The observation at time t is modeled as solely a function of the true state at time t, perturbed by independent measurement noise. It does not carry information about how the state arrived at time t, nor does it directly influence how the state will evolve from time t (that role is reserved for the state equation). This clear decomposition of uncertainty—one part from the system’s evolution (process noise) and another from the measurement process (observation noise)—is vital for accurate state estimation and prediction. It simplifies the likelihood function and facilitates optimal filtering algorithms where the measurement update step precisely accounts for new information from the observation without conflating it with dynamic uncertainties.

2. The Linear Gaussian State Model: Core Formulation

The Linear Gaussian State Model (LGSM) is a specific type of state-space model characterized by linear relationships between states and observations, and by Gaussian (normal) distributions for its noise terms. This combination of linearity and Gaussianity is what makes the model analytically tractable and allows for the development of optimal estimation algorithms like the Kalman filter.

2.1. Fundamental Equations

The LGSM is defined by two fundamental linear equations: the state equation, which describes the evolution of the hidden state, and the observation equation, which relates the hidden state to the observable measurements.

2.1.1. The State Equation (Process Equation)

The state equation describes how the hidden state vector xₜ evolves over time. It is a first-order linear difference equation:

xₜ = Fₜxₜ₋₁ + Bₜuₜ + νₜ

This equation indicates that the state at time t is a linear transformation of the state at time t−1, possibly influenced by exogenous inputs (uₜ), and perturbed by a stochastic term known as process noise (νₜ).

2.1.2. The Observation Equation (Measurement Equation)

The observation equation links the observed data vector yₜ to the hidden state vector xₜ at the same time point:

yₜ = Hₜxₜ + Dₜuₜ + εₜ

This equation shows that the observed data is a linear transformation of the current state, potentially influenced by exogenous inputs (uₜ), and corrupted by observation noise (εₜ).

2.2. Detailed Explanation of Components

Understanding each component of these equations is critical for grasping the LGSM’s functionality.

State Vector (xₜ): This is an m×1 vector representing the unobserved (hidden) state of the system at time t. It encapsulates the essential information about the system’s internal dynamics. For example, in a tracking application, xₜ might include the position and velocity of an object. In an economic model, it could represent underlying economic trends or latent factors that drive observable indicators. The state vector is the core unobservable variable that evolves according to the system’s internal dynamics.
Observed Data Vector (yₜ): This is a p×1 vector representing the measurements collected at time t. These are the actual data points that are available, and they are typically noisy or incomplete representations of the true state. The observed data vector serves as the observable output of the system, from which the hidden state is inferred.
State Transition Matrix (Fₜ): An m×m matrix that defines how the state vector evolves from time t−1 to t. It captures the system’s internal dynamics. If Fₜ is constant over time, it implies that the system’s dynamics are time-invariant. This matrix governs the deterministic evolution of the hidden state over time.
Measurement Matrix (Hₜ): A p×m matrix that maps the hidden state vector xₜ to the observed data vector yₜ. It describes how the unobserved state variables are linearly transformed into the measurements. This matrix defines the relationship between the unobserved state and the observed data.
State Noise (νₜ): An m×1 vector representing the process noise or system noise. This stochastic term accounts for unmodeled dynamics, disturbances, or uncertainties in the system’s evolution that are not captured by the deterministic part of the state equation. It captures the inherent stochasticity and inaccuracies in the state evolution model.
Observation Noise (εₜ): A p×1 vector representing the measurement noise. This term accounts for inaccuracies, errors, or disturbances in the observation process itself, such as sensor noise or environmental interference. It captures the stochasticity and inaccuracies in the measurement process.
Covariance Matrices (Qₜ, Rₜ): Qₜ is the m×m covariance matrix of the state noise νₜ, and Rₜ is the p×p covariance matrix of the observation noise εₜ. These matrices are crucial as they quantify the uncertainty and correlation within their respective noise terms. They define the statistical properties (variance and covariance) of the noise, which are essential for optimal state estimation algorithms.

The matrices (Fₜ, Hₜ) define the deterministic structure of the system, outlining how the state evolves and how observations are generated. In contrast, the noise terms (νₜ, εₜ) and their covariance matrices (Qₜ, Rₜ) introduce stochasticity and uncertainty into the model. The interplay between these deterministic and stochastic components is critical for accurate modeling and inference. For example, a small Qₜ implies a highly predictable state evolution, suggesting minimal unmodeled disturbances. Conversely, a large Rₜ indicates very noisy measurements, meaning the observations are less reliable. This balance directly impacts how much weight an estimation algorithm, such as the Kalman filter, assigns to the model’s prediction versus the incoming new observations. If process noise (Qₜ) is high, the model’s prediction is less trusted, and the algorithm will rely more heavily on the observation. If measurement noise (Rₜ) is high, the observation is less reliable, and the model will lean more on its own prediction. This dynamic weighting is a fundamental strength of the LGSM for robust estimation.

To summarize the components of the Linear Gaussian State Model, the following table provides a concise overview:

Table 1: Components of the Linear Gaussian State Model Equations

Component Name	Symbol	Typical Dimensions	Description/Role
State Vector	xₜ	m×1	Unobserved (hidden) state of the system
Observed Data Vector	yₜ	p×1	Observed measurements of the system
State Transition Matrix	Fₜ	m×m	Governs the evolution of the hidden state over time
Input Matrix (State)	Bₜ	m×q	Maps exogenous inputs to the state equation
Measurement Matrix	Hₜ	p×m	Defines relationship between state and observations
Input Matrix (Observation)	Dₜ	p×q	Maps exogenous inputs to the observation equation
State Noise	νₜ	m×1	Unmodeled dynamics, disturbances in state evolution
Observation Noise	εₜ	p×1	Inaccuracies, errors in the measurement process
State Noise Covariance	Qₜ	m×m	Quantifies uncertainty in state noise
Observation Noise Cov.	Rₜ	p×p	Quantifies uncertainty in observation noise
Exogenous Input	uₜ	q×1	External factors influencing the system or observations

3. Distributional Assumptions and Properties

The “Gaussian” aspect of the Linear Gaussian State Model refers to critical statistical assumptions about the noise terms and the initial state. These assumptions are not arbitrary; they are the bedrock upon which the analytical tractability and optimality of associated algorithms, particularly the Kalman filter, are built.

3.1. Assumptions for Noise Terms

Both the state noise (νₜ) and observation noise (εₜ) are assumed to follow specific distributional properties:

Gaussianity: Both noise terms are assumed to be normally (Gaussian) distributed.
- νₜ ∼ N(0,Qₜ)
- εₜ ∼ N(0,Rₜ)
Zero-Mean: Both noise terms are assumed to have a mean of zero. This implies that, in the absence of noise, the model equations accurately capture the central tendency of the system and observation processes.
White Noise: The noise terms are assumed to be serially uncorrelated over time. This means that νₜ is independent of νₛ for t≠s, and similarly for εₜ. This “memoryless” property for the noise simplifies the statistical analysis.

3.2. Assumptions for the Initial State (x₀)

The initial state x₀ is also assumed to be normally distributed: x₀ ∼ N(x̂₀,P₀)

Here, x̂₀ represents the initial mean estimate of the state, and P₀ is its initial covariance matrix, quantifying the uncertainty in this initial estimate.

3.3. Independence Assumptions

Crucial independence assumptions further simplify the model’s statistical properties:

Independence between Noise Terms: νₜ and εₛ are assumed to be independent for all t,s. This means that process noise, which affects the system’s internal dynamics, does not directly influence measurement noise, which affects the observation process.
Independence from Initial State: Both νₜ and εₜ are assumed to be independent of the initial state x₀. This ensures that the initial uncertainty and subsequent noise disturbances are distinct.

These Gaussian and independence assumptions are fundamental because they enable the analytical tractability and optimality of state estimation algorithms. Without them, exact analytical solutions for state estimation, which involve complex conditional probability distributions, would be impossible. Instead, one would need to resort to computationally intensive approximations, such as particle filters for non-Gaussian systems or Extended Kalman Filters for non-linear systems. The mathematical benefit of Gaussianity is that the sum of Gaussian random variables is also Gaussian, and linear transformations of Gaussian variables remain Gaussian. This “closure property” ensures that if the initial state and noise are Gaussian, then all subsequent states and observations will also be Gaussian. The independence assumptions further simplify joint probability distributions into products of marginal distributions, making calculations feasible.

Collectively, these assumptions allow the posterior distribution of the state to be exactly Gaussian at each time step. This means the distribution can be fully characterized by its mean and covariance, leading to the elegant and computationally efficient recursive updates of the Kalman filter. The Kalman filter, under these conditions, provides the Minimum Mean Squared Error (MMSE) estimate, which is the optimal estimate for linear Gaussian systems. Deviations from these assumptions, such as non-Gaussian noise or non-linear dynamics, necessitate more complex, often approximate, filtering techniques, significantly increasing computational burden and theoretical complexity.

The following table summarizes the key assumptions underpinning the Linear Gaussian State Model:

Table 2: Key Assumptions of the Linear Gaussian State Model

Assumption Category	Specific Assumption	Rationale/Implication
Noise Distribution	νₜ ∼ N(0,Qₜ)	Enables analytical solutions like the Kalman filter; assumes unmodeled dynamics are random and centered around zero.
	εₜ ∼ N(0,Rₜ)	Assumes measurement errors are random, centered around zero, and normally distributed.
Initial State	x₀ ∼ N(x̂₀,P₀)	Provides a starting point for the recursive estimation process; initial uncertainty is quantified.
Independence	νₜ ⊥ εₛ for all t,s	Process noise and measurement noise are distinct and do not directly influence each other; simplifies likelihood calculations.
	νₜ ⊥ x₀ for all t	Process disturbances are independent of the initial system state.
	εₜ ⊥ x₀ for all t	Measurement errors are independent of the initial system state.
	νₜ and εₜ are white noise	Noise at one time point is uncorrelated with noise at other time points, simplifying temporal dependencies.

4. Extensions and Enhancements

The basic Linear Gaussian State Model can be extended to handle more complex and realistic scenarios, particularly through the inclusion of exogenous variables and the allowance for time-varying parameters. These enhancements significantly broaden the model’s applicability.

4.1. Incorporating Exogenous Variables (uₜ)

Exogenous variables, also known as control inputs or covariates, are external factors that influence the system but are not themselves part of the state vector. Their incorporation into both the state and observation equations allows the model to account for known external influences.

In the state equation, exogenous inputs are introduced via the Bₜ matrix: xₜ = Fₜxₜ₋₁ + Bₜuₜ + νₜ

Here, Bₜ is an m×q input matrix that maps the q×1 exogenous input vector uₜ to the state space. This allows external control actions (e.g., engine thrust in an aircraft) or known environmental factors (e.g., temperature affecting a chemical process) to directly influence the system’s evolution.

Similarly, in the observation equation, exogenous inputs can be included via the Dₜ matrix: yₜ = Hₜxₜ + Dₜuₜ + εₜ

Here, Dₜ is a p×q feedforward matrix that allows exogenous inputs to directly affect the observations. This is useful when the measurement itself is influenced by external factors independently of their influence on the state. For example, a sensor reading might be directly affected by ambient light conditions (uₜ) in addition to the true state it’s trying to measure.

The ability to incorporate exogenous variables significantly enhances the model’s realism and practical utility, especially in control systems, econometrics, or any domain where external, measurable interventions or influences exist. It transforms the model from a purely descriptive tool into a potentially predictive and prescriptive one. By explicitly accounting for known influences, the model’s predictions become more accurate, as less variance needs to be attributed solely to the noise terms. In control engineering, uₜ represents control signals, allowing the model to be used for designing optimal control strategies. In econometrics, uₜ could be policy variables, enabling “what-if” analyses of different policy interventions. This extension allows for modeling direct causal effects of external factors on both the system’s true state and its observations, making the LGSM a powerful tool for fields requiring active management or prediction under varying external conditions.

4.2. Discussion of Time-Varying Parameters

The notation in the core equations, where matrices like Fₜ,Bₜ,Hₜ,Dₜ and covariance matrices Qₜ,Rₜ are explicitly subscripted with t, signifies that these parameters can change over time. This flexibility allows the model to adapt to evolving system dynamics or measurement characteristics.

Time-varying parameters enable the LGSM to model systems whose underlying dynamics or measurement properties are not static. For instance, a missile’s mass decreases as fuel burns, which changes its dynamic response; an economy’s responsiveness to policy changes might evolve over decades; or a sensor’s precision could degrade over time. If these changes are known or can be estimated, the Kalman filter and other LGSM-based algorithms can still be applied effectively. This adaptability is crucial for modeling long-term or non-stationary processes. It allows the model to capture non-stationarities, regime shifts, or known changes in the system’s physical properties or measurement equipment. This flexibility greatly expands the range of phenomena that can be accurately modeled by the LGSM, moving beyond simple stationary processes to complex, evolving systems.

5. Advantages and Applications

The Linear Gaussian State Model, along with its underlying state-space framework, offers numerous practical benefits that contribute to its widespread adoption across diverse scientific and engineering disciplines.

5.1. Flexibility for Modeling Complex Multivariate Time Series

The state-space formulation naturally handles multiple interacting time series variables by representing them within a single state vector and observation vector. This capability allows for the modeling of complex interdependencies and dynamic relationships among multiple series.

The state-space representation provides a unified and systematic way to manage the inherent complexity of real-world multivariate time series. Instead of developing separate, ad-hoc models for each series, it integrates them into a single, coherent dynamic system. For example, in financial markets, stock prices, interest rates, and inflation are interrelated. In a complex engineering system, multiple sensor readings from different points are interconnected. The LGSM allows for simultaneous estimation and prediction of all these series, capturing lead-lag relationships, common trends, and dynamic dependencies that are often difficult to model with univariate methods. It also offers the potential for dimensionality reduction if the underlying state that drives the observations is lower-dimensional than the observed data itself. This makes the LGSM invaluable for fields like econometrics, signal processing, and control engineering, where systems are inherently multivariate and interconnected.

5.2. Handling Missing Data

A significant practical advantage of the state-space framework is its inherent ability to handle missing observations seamlessly. Many traditional time series analysis methods struggle with or require explicit imputation for missing data, which can introduce bias or complexity into the analysis.

In contrast, the state estimation process within the LGSM, particularly with the Kalman filter, is recursive. If an observation is missing at a particular time step, the filter simply relies on its prediction step (based on the estimated state from the previous time step) to estimate the current state, without performing an observation update. The uncertainty, as quantified by the state covariance matrix, will naturally increase to reflect the lack of new information. This means no explicit imputation is needed; the model naturally propagates uncertainty and provides the best possible estimate given the available data, avoiding the complexities and potential biases of imputation methods. This robustness to incomplete data makes the LGSM highly applicable in scenarios where data collection is intermittent or unreliable, such as in sensor networks, medical monitoring, or financial data analysis where data dropouts are common.

5.3. Representing Various Structures

Many classical time series models, such as Autoregressive Moving Average (ARMA) models, can be reformulated and represented within the state-space framework. This demonstrates the state-space model’s generality and its role as a unifying framework in time series analysis. For instance, a higher-order Autoregressive (AR(p)) model can be transformed into a first-order state-space representation by defining the state vector to include lagged values of the series.

This capability highlights that many seemingly disparate time series models are, in fact, special cases or alternative parameterizations within the broader state-space structure. This provides a common ground for analysis, estimation, and forecasting. The unification means that a single set of powerful algorithms, such as the Kalman filter for estimation and the Kalman smoother for smoothing, can be applied to a wide variety of models. This simplifies the methodological toolkit and allows for easier combination of different model components. The LGSM is thus not just another model, but a meta-framework that encompasses and generalizes many established time series techniques, making it a cornerstone of modern time series analysis.

5.4. Practical Applications

The versatility of the LGSM in modeling dynamic systems under uncertainty has led to its widespread application across numerous scientific and engineering disciplines. While specific examples from the provided material were inaccessible, common application domains illustrate its practical utility.

Signal Processing & Control Systems: One of the most prominent applications is in tracking objects, such as aircraft, missiles, or vehicles, using noisy sensor data from sources like GPS or radar. The Kalman filter, derived from the LGSM, is fundamental for navigation, guidance, and control systems, providing robust state estimates even with significant measurement noise.
Econometrics & Finance: In these fields, the LGSM is used for modeling complex economic indicators (e.g., GDP, inflation, interest rates), forecasting stock prices, modeling volatility, and estimating unobserved economic components like permanent income or business cycles. Its ability to handle latent variables and missing data is particularly valuable here.
Biomedical Engineering: The model is applied to monitor physiological signals (e.g., ECG, EEG) to estimate underlying physiological states, model drug pharmacokinetics, and track patient vital signs. It helps in extracting meaningful information from noisy biological data.
Environmental Science: LGSMs are used to model climate dynamics, pollutant dispersion, and ecological population dynamics, often leveraging data from noisy sensor networks.
Robotics: In robotics, the LGSM is crucial for tasks like Simultaneous Localization and Mapping (SLAM), robot navigation, and sensor fusion, where a robot needs to estimate its own state and map its environment simultaneously using imperfect sensor readings.

The following table summarizes common applications of the Linear Gaussian State Model:

Table 3: Common Applications of the Linear Gaussian State Model

Application Domain	Specific Use Case	Benefit of LGSM
Signal Processing & Control Systems	Object Tracking (e.g., aircraft, vehicles)	Robust state estimation in noisy environments, real-time prediction for guidance and control.
Econometrics & Finance	Economic Forecasting, Volatility Modeling	Handles unobserved economic factors, missing data, and complex interdependencies in financial time series.
Biomedical Engineering	Physiological Monitoring, Drug Pharmacokinetics	Infers latent biological states from noisy measurements, tracks dynamic physiological processes.
Environmental Science	Climate Modeling, Pollutant Tracking	Models complex environmental dynamics, handles noisy sensor data from distributed networks.
Robotics	Simultaneous Localization and Mapping (SLAM)	Fuses multiple noisy sensor inputs to estimate robot pose and map environment accurately.

6. Relationship to Other Models

The Linear Gaussian State Model is not an isolated framework but is deeply interconnected with other established time series models, notably ARMA models. This relationship underscores the LGSM’s generality and its role as a unifying paradigm in time series analysis.

6.1. Connection between Stationary ARMA Models and Stationary State-Space Models

A significant theoretical result in time series analysis is that any stationary ARMA (Autoregressive Moving Average) model can be represented in a state-space form. Conversely, under certain conditions, a stationary state-space model can be reduced to an ARMA representation. This equivalence demonstrates the broad applicability and theoretical depth of the state-space framework.

Consider a simple example: an AR(1) process with observational noise.

An AR(1) process is typically defined as: zₜ = ϕzₜ₋₁ + wₜ, where wₜ is white noise.

If we observe yₜ = zₜ + εₜ, where εₜ is observation noise, this system can be perfectly cast into the LGSM form:

State Equation: xₜ = ϕxₜ₋₁ + νₜ Here, the hidden state xₜ is simply zₜ, the true underlying AR(1) process. The state transition matrix Fₜ is the scalar ϕ. The process noise νₜ is wₜ.

Observation Equation: yₜ = 1·xₜ + εₜ The measurement matrix Hₜ is the scalar 1, indicating a direct observation of the state.

For higher-order ARMA models, the state vector would be augmented to include lagged values of the series and/or past noise terms to achieve the first-order Markovian structure required by the state-space representation.

The equivalence between ARMA models and state-space models is more than a theoretical curiosity; it provides a powerful computational advantage. It means that the robust and efficient algorithms developed specifically for state-space models, such as the Kalman filter and Kalman smoother, can be directly applied to estimate and forecast ARMA processes. These algorithms are often more efficient and flexible than traditional ARMA estimation methods, particularly in handling issues like missing data. The Kalman filter, being recursive and linear, avoids the complex iterative optimization often required for maximum likelihood estimation of ARMA models, especially those with moving average components. This unification means that a single, well-understood algorithmic framework can address a wide variety of classical time series problems. This highlights the state-space model not merely as an alternative model, but as a more general framework that encompasses and generalizes many established time series techniques, making it a cornerstone of modern time series analysis.

7. Conclusion

The Linear Gaussian State Model stands as a cornerstone in the field of dynamic system modeling, offering a powerful and flexible framework for understanding and predicting the behavior of systems over time, especially in the presence of uncertainty.

7.1. Summary of Key Concepts

At its core, the LGSM is built upon two fundamental principles: a hidden Markov state process, where the current state encapsulates all necessary information from the past to predict the future, and observations that are conditionally independent given the current state. These principles allow for a clear separation between the unobservable true system dynamics and the noisy, incomplete measurements. The model is mathematically defined by two linear equations—the state equation describing state evolution and the observation equation linking states to measurements—each comprising specific components like state vectors, observation vectors, transition and measurement matrices, and their respective noise terms. The “Gaussian” aspect of the model refers to the critical assumptions that these noise terms and the initial state are normally distributed and mutually independent, which are crucial for the analytical tractability and optimality of associated estimation algorithms like the Kalman filter.

7.2. The Model’s Utility and Versatility

The utility of the LGSM is profound, stemming from its inherent strengths in handling complex multivariate time series, its natural ability to manage missing observations without requiring explicit imputation, and its capacity to represent a wide array of dynamic system structures, including many classical time series models like ARMA processes. Its flexibility allows it to adapt to time-varying system dynamics and measurement characteristics, further extending its applicability to non-stationary real-world phenomena.

The LGSM serves as a fundamental tool across numerous quantitative disciplines, from engineering and signal processing to econometrics, finance, and biomedical research. It enables robust state estimation, accurate prediction, and effective control in environments characterized by inherent uncertainty and noise. Beyond its direct applications, the Linear Gaussian State Model also serves as a foundational building block for more complex and realistic models. Understanding the LGSM is a prerequisite for tackling advanced topics such as non-linear state-space models (e.g., Extended Kalman Filter, Unscented Kalman Filter) and non-Gaussian state-space models (e.g., Particle Filters). The algorithms derived for the LGSM provide the conceptual and algorithmic intuition necessary to develop approximate solutions for these more challenging scenarios. Therefore, mastering the LGSM is not an end in itself but a crucial first step, providing the essential framework for modeling the full spectrum of dynamic systems under uncertainty.

References

My demo of Gaussian Processes

How to review code in data science

2025-07-31T00:00:00+00:00

Code review in software engineering is an essential part of the development process. It helps ensure code quality, facilitates knowledge sharing, and promotes best practices among team members.

Through my experience and perspective, data science requires a unique approach to code reviews compared to software engineering, given its characteristics such as exploratory nature, reliance on data, and importance behind the mathematics.

As a reviewer and contributor, I have found that the following aspects are crucial when reviewing code in data science:

Understand the Problem Domain: Before diving into the code, ensure you have a clear understanding of the problem being solved. This includes the business context, data sources, and expected outcomes. A solid grasp of the domain helps you evaluate whether the code effectively addresses the problem.
Focus on Data Handling: Data is at the core of data science. Review how the code handles data loading, cleaning, and preprocessing. Ensure that the data is appropriately transformed and that any assumptions made about the data are clearly documented.
- Check for potential data leakage, missing values, and outliers.
- Ensure that the data is being used ethically and responsibly. It is tempting to use production data for testing, but this can a) slow down the process and b) lead to overload the system. Instead, use a subset of the data or synthetic data for testing purposes.
Clarify the Code’s Purpose: Data science code often involves complex algorithms and mathematical operations. Ensure that the code is well-commented and that the purpose of each function or block of code is clear especially for algorithm and mathematical components. This helps others understand the logic and reasoning behind the implementation.

Unlike software engineering, it updates the same functionality over time due to constant evolution of algorithms and models. Moreover, data drift can force to update the existing code base. Therefore, it is essential to document the changes made to the code, including the rationale behind them.

pile of thoughts

1. Give Clear Direction

2. Provide the Necessary Tools

3. Let Go

4. Act with the Greater Good in Mind

5. Set Aside Time to Reflect

The Management Habits That Sustain Leadership

1. Set Clear Expectations

2. Communicate Well

3. Establish a Steady Meeting Rhythm

4. Hold Quarterly Conversations

5. Reward and Recognize

Building a Culture of Trust and Accountability

Judge the value of your company

Appendix: Typical Valuation Multiples by Industry & Stage

Leader To Follower

Rold of AI&ML in forecasting

The Reality of AI/ML in Forecasting

The Power of Gradient Boosted Trees

A Prudent Approach to Forecasting

Statistical Modeling and Objective function

The Default: MSE and its Hidden Assumption

When the Default Fails: The World of Count Data

The Right Tools for the Job: Poisson & Negative Binomial

Poisson: The Go-To for Simple Counts

Negative Binomial: For Messy, Real-World Counts

The Connection Between Objective Functions and Statistical Modeling via MLE

1. The Principle of Maximum Likelihood Estimation (MLE)

2. From Distributions to Loss Functions

3. Why Does This Matter?

Key Takeaway

Understanding XGBoost: A Deep Dive into the Mathematics and From-Scratch Implementation

1. What Is XGBoost? The Additive Model Framework

🔍 What Is an Instance?

🔍 What Is a Leaf in a Decision Tree?

2. The XGBoost Objective: Balancing Fit and Complexity

3. Second-Order Taylor Expansion: Making the Objective Tractable

4. Tree Structure and Leaf-Wise Optimization

🧮 Rewriting the Objective by Leaves

✅ Optimal Leaf Weight

📉 Optimal Objective Value

5. Split Evaluation: The Gain Formula

🧠 Interpretation

6. The Exact Greedy Algorithm: How Splits Are Found

🌲 Handling Ordinal Features

⏱️ Computational Complexity

🛑 Split Validity: min_child_weight

7. From-Scratch Python Implementation (Standard Library Only)

✅ Example Usage

8. Key Insights

9. Conclusion

My Expert Forecasts Got Crushed by a Dumb Algorithm

The Underrated Power of Simple Baselines

Why My Brain Failed Me: The Peril of Cognitive Biases

My New Forecasting Playbook: A Hybrid Approach

1. Baselines Are Mandatory

2. Systematically Mitigate Bias

3. Combine, Don’t Just Choose

4. Prioritize Interpretability

Final Thoughts

Forecasting Accuracy Metrics

Beyond MAPE & RMSE: A Strategic Guide to Forecasting Metrics

Why Forecast Accuracy Matters More Than Ever

Demystifying Point Forecast Metrics: A Taxonomy

Understanding the Categories

The Intermittent Demand Challenge

The Myth of the “Best” Metric

Handling Outliers & Distribution

Leveling Up: Advanced Evaluation Techniques

Time Series Cross-Validation (Rolling Forecast Origin)

Evaluating Probabilistic Forecasts

Residual Analysis: The Diagnostic Powerhouse

Lessons from the Trenches: AEMO & KPMG

Australian Energy Market Operator (AEMO)

KPMG (Financial Forecasting)

Strategic Recommendations & Conclusion

Choosing Your Metrics Wisely

Embrace Hybrid Metric Sets

Conclusion: A Framework for Confidence

The Linear Gaussian State Model: Dynamic Systems

🛑 Split Validity: `min_child_weight`