Model | Task | Model Output | KPI | Target Value for Seed | Current Status | Comment |
---|---|---|---|---|---|---|
iGN | AI early disease warning | Alert | Sensitivity/False negative rate | 0.51 | ||
iGN | Explicit case definitions | Alert | Specificity | 0.60 | ||
iGN | Individual disease prediction | Outbreak Probability | Average precision for historical anthrax outbreaks in Victoria | 0.65 | 0.69 | |
iGN | Trader regression | Data feed | RMSE between predicted and actual futures prices | 0.65 | 0.66 |
During our Seed Stage, iGN (ingenum Graph Network) will be primarily based on data from WorkMate. Using iLM (ingenum Language Model), clinically relevant data points will be extracted from clinical notes and invoices. iGN will probabilistically map these to diagnoses.
Our primary ForeSight customer during the Seed Stage will be the financial services industry for the purpose of futures trading. To provide useful outputs to those customers, iGN does require a minimal sample of data to train a useful model. However, the absolute volume of data is less useful than the coverage of that data. The penetration of the WorkMate product among vets is a strong proxy for the quality of the sample.
For simplicity, assume that we model the impact of animal health on commodities futures as a single variable. Two important metrics for the purposes of futures trading are:
Futures traders will combine these metrics with other data in order to make their trades. The absolute quantity of data collected will improve the quality of the first metric up to a point of diminishing returns, but the more impactful consideration will be whether the sample is statistically representative of the population. This has both geographic and animal density dimensions.
Because it is more sensitive to sample size, the quality of the tail risk metric will continue to improve with WorkMate penetration.
We measure coverage - and by extension the quality of metric 1 - by means of the Geographically-Weighted Gini Coefficient (GWGC).
$$ \text{GWGC} = 1 - \sum_{i=1}^{n} \left[(P_i - P_{i-1}) \cdot (D_i + D_{i-1})\right] $$
$$ \begin{aligned}i &= 1, 2, \ldots, n \text{ (where $n$ is the total number of regions)} \\P_i &= \text{Cumulative Proportion of Population up to region $i$} \\D_i &= \text{Cumulative Proportion of Data up to region $i$} \\P_0 &= D_0 = 0\end{aligned} $$
To quantify the precise impact of increasing the sample size (n) and decreasing the Geographically Weighted Gini Coefficient (GWGC) on the quality of the maximum likelihood estimate (MLE) of the Weibull distribution parameters, we need to consider the asymptotic properties of the MLE and the relationship between the GWGC and the sample's representativeness.
Let's denote the true Weibull distribution parameters as $θ = (α, β)$, where $α$ is the shape parameter and $β$ is the scale parameter. The MLE of $θ$ based on a sample of size n is denoted as $θ̂ = (α̂, β̂)$.
Combining the effects of sample size and GWGC, we can express the impact on the quality of the MLE using the following formula:
$√(n(1 - G))(θ̂ - θ) → N(0, I(θ)^{-1}$)
This formula indicates that as the sample size n increases and the GWGC (G) decreases, the quality of the MLE improves. The term $√(n × (1 - G))$ represents the effective sample size, considering both the actual sample size and the representativeness factor $(1 - G)$. As this term increases, the MLE $θ̂$ converges more quickly to the true parameter $θ$, and the covariance matrix $I(θ)^{-1}$ becomes smaller, indicating more precise estimates.
To analyze the tail risk in the context of the Weibull distribution and its relationship with the sample size ($n$) and the Geographically Weighted Gini Coefficient (GWGC), we need to define a measure of tail risk. One common measure of tail risk is the Value-at-Risk (VaR) at a given confidence level.
For the Weibull distribution with shape parameter α and scale parameter $β$, the VaR $V$ at a confidence level of $(1 - p)$ is given by:
$V(p) = β(-ln(1 - p))^{1/α}$
where $0 < p < 1$ is the probability of exceeding the VaR threshold.
Now, let's consider the impact of increasing the sample size ($n$) and decreasing the GWGC on the estimation of the VaR.
To quantify the impact of increasing n and decreasing GWGC on the tail risk estimation, we can combine the effects on the parameter estimates and the VaR calculation:
$V(p, n, G) = β̂(n, G) ×(-ln(1 - p))^{1/α̂(n, G)}$
where $β̂(n, G)$ and $α̂(n, G)$ are the MLEs of the Weibull parameters based on a sample of size $n$ and GWGC $G$.
As the sample size n increases and the GWGC G decreases, the estimates $β̂(n, G)$ and $α̂(n, G)$ become more accurate and precise. Consequently, the estimated $V(p, n, G)$ becomes a more reliable measure of the tail risk.
The rate of improvement in the tail risk estimation can be assessed by comparing the estimated VaR values for different combinations of $n$ and $G$. For example, you can calculate $V(p, n_1, G_1)$ and $V(p, n_2, G_2)$ for two different scenarios with sample sizes n1 and n2 and GWGC values G1 and G2, respectively. The relative difference between the two VaR estimates indicates the impact of changing $n$ and $G$ on the tail risk estimation.
It's important to note that the choice of the confidence level $(1 - p)$ also affects the tail risk estimation. A higher confidence level focuses on more extreme events in the tail of the distribution, while a lower confidence level considers a broader range of tail events. The appropriate choice of the confidence level depends on the specific risk management objectives and the nature of the underlying risk being assessed.