Aktuelle Forschungsarbeiten
Auf der Seite finden Sie einen ?berblick zu aktuellen Forschungsarbeiten des Lehrstuhls. Die Working Paper k?nnen gerne auf Nachfrage versendet werden.
- On the use of satellite information to estimate agricultural carbon footprint in a small area Framework
Pajno R., Carillo F., Maranzano P., Schmid, T., Borgoni R.
The agricultural sector is undergoing rapid change due to climate pressures, demographic shifts, and uneven economic development, increasing the demand for reliable environmental indicators at fine spatial scales. However, limited data availability often constrains subregional analyses. This study develops a model-based framework for producing reliable small-area estimates for assessing the agricultural carbon footprint in the Po Valley (Northern Italy), a region characterized by intensive livestock farming and high environmental pressure. We integrate survey, census, and satellite-derived emission data into a unified framework and produce estimates at the level of Agrarian Subregions, defined as agriculturally homogeneous municipalities by the Italian National Institute of Statistics. Satellite-based ammonia emission data are incorporated as auxiliary covariates to improve precision and spatial coherence. A key methodological contribution is the treatment of spatial misalignment between gridded satellite data and administrative boundaries. This issue is addressed through a geostatistical upscaling procedure combined with a parametric bootstrap that propagates uncertainty from the covariate construction stage to the final small-area estimates. The results show that satellite-derived information substantially improves the accuracy and stability of carbon footprint estimates while reducing reliance on large, heterogeneous auxiliary datasets, illustrating the potential of Earth observation data in model-based environmental statistics.
- Regionalised Estimation of Population Figures using Machine Learning Algorithms
Krabbel, F., Neef, S., Schmid, T.
Small-area population data are essential for political planning and the monitoring of sustainability goals, yet in Germany they are only available at the municipal level between census years. This paper evaluates the disaggregation method proposed by Stevens et al. (2015), in which a Random Forest model captures the relationship between spatial covariates and population density at the municipal level, subsequently disaggregating the results to INSPIRE-compliant grid sizes of 100 m × 100 m and 1 km × 1 km using a top-down approach. Based on Bavarian census data from 2022, the reference model is compared against several extensions, including a Box-Cox transformation of the target variable, stricter variable selection, and Gradient Boosting as well as a linear model as alternative weighting algorithms. Prediction accuracy is evaluated against true census counts at the grid level, allowing for a realistic assessment of the method's performance at fine spatial resolution.
Selection Bias in Web Surveys: Evaluating Classical and Machine Learning Correction Methods Across European Countries
Prücklmair, F.; Rendtel, U.
Web surveys suffer from systematic selection bias when participants self-select based on characteristics correlated with the outcome of interest. This study evaluates five estimators for correcting such bias in the context of self-assessed health across 30 European countries. Using the most recent wave of the European Social Survey (ESS, Round 11, 2023/2024) as a data foundation, nonprobability samples are simulated by selecting individuals based on above-median daily internet usage time, thereby accounting for recent technological developments that have rendered mere internet access an increasingly non-discriminating indicator of selection. The estimators under comparison cover classical and more recent approaches: propensity score adjustment (PSA) and doubly robust estimation (DR) based on logistic and linear regression, a Random Forest-based propensity score estimator (RF-PSA), an XGBoost-based doubly robust estimator (XDR), and a CatBoost-based doubly robust estimator (CatDR), representing a novel application of CatBoost within the nonprobability sample correction framework. Estimator performance is evaluated against the backdrop of the ignorability assumption, which is tested separately for each country and shown to hold in the majority of cases. Where this assumption is satisfied, bias can be substantially reduced across countries on the basis of relative bias, RRMSE, and confidence interval coverage. CatDR achieves particularly strong results across these metrics, suggesting that gradient boosting with native categorical feature handling offers a promising extension to existing correction approaches.
- A geo-statistical small area estimation framework for crop yields under reduced sample designs
Kyalo, R.K.
Reliable crop yield statistics are critical for agricultural planning and food-security monitoring. In Kenya, however, large-scale agricultural surveys, such as the KIHBS 2015/ 2016, are costly and infrequently implemented, limiting the timeliness of official yield estimates. This study proposes an enhanced small area estimation (SAE) framework that integrates agro-ecological zone (AEZ) indicators and geospatial covariates into a Fay-Herriot (FH) model to generate county level estimates for major food crops. By incorporating AEZ effects, the model achieved significantly improved fit and predictive capacity, as validated through information criteria and likelihood-ratio testing. To address the challenge of high survey costs, a systematic sensitivity analysis was conducted by subsampling the original survey to 80%, 50%, 30%, and 10%. The results demonstrate that at a 30% sample size, the FH estimates maintain levels of precision comparable to the full-sample direct estimates, which typically require 80% or more of the sample to achieve similar reliability. These findings suggest that substantial efficiency gains are possible by combining reduced survey designs with model-based estimation. This study provides empirical evidence that Kenya’s National Bureau of Statistics could reduce survey costs and increase the frequency of monitoring by adopting AEZ-explicit SAE methods without compromising statistical integrity.
- Small area estimation based on twofold models with the R package EMDI
Kyalo, R.K.; Lee, Y.; Schmid, T.; Würz, N.
The onefold empirical best predictor (EBP) approach proposed by Molina and Rao (2010), along with the empirical best linear unbiased predictor (EBLUP) approach introduced by Fay and Herriot (1979), is implemented in the R package emdi. We enhance the functionality of the emdi package by integrating twofold EBP and twofold EBLUP models. Building upon the theoretical foundations laid by Marhuenda et al. (2017) and Torabi and Rao (2014), we implement the recent advancements in twofold small area model, introducing features such as data-driven transformations, parametric bootstrap MSE estimation, and the estimation of ratio-type indicators. A key highlight of this implementation is the ability to display and visualize estimates at both levels of aggregation, empowering researchers and practitioners to address the hierarchical structure of data more effectively. By overcoming the limitations of onefold models, these capabilities facilitate more precise small area estimates and support decision-making at disaggregated levels. Additionally, the package supports customization of estimation options and leverages parallelization for efficient computation. These functionalities are implemented to align with the existing framework for onefold models in emdi, ensuring ease of use for practitioners.
Flexible Bayesian Mixed-Effects Models for Small Area Estimation
Voll, L.; Goes, J.; Schmid, T.
Classical unit-level mixed models remain a standard approach in small area estimation. However, their assumptions (normality and linearity) often prevent them from capturing skewness and complex relationships commonly present in socio-economic data. This may lead to biased point estimates and unreliable measures of uncertainty. Moreover, standard frequentist approaches typically rely on global parametric transformations to satisfy these assumptions, which may be inadequate when distributional characteristics and functional relationships vary across covariates or hierarchical levels. To address these limitations, we propose a semi-parametric Bayesian mixed-effects framework in which the conditional mean is modelled using flexible non-parametric machine learning methods, while retaining a parametric hierarchical structure for random effects. This hybrid approach allows for the modelling of complex non-linearities and implicitly accommodates non-Gaussian features of the response without relying on restrictive transformation assumptions. In addition, the framework yields full posterior predictive distributions at the area level, enabling coherent probabilistic inference for the indicators of interest. The proposed method is evaluated in a comprehensive Monte Carlo simulation study and compared with established approaches, including the Empirical Best Predictor (EBP) and Mixed-Effects Random Forest (MERF). Performance is assessed with respect to point estimation accuracy, while distributional accuracy is further evaluated using proper scoring rules.
Twofold nested error regression models with data-driven transformation
Kyalo, R.K.; Schmid, T.; Würz,N.
Small area estimation effectively addresses the issue of small sample sizes within sub-populations. Typically, the target population is divided into multiple nested hierarchical levels, such as counties and sub-counties. A twofold nested error regression model with area and sub-area random effects captures the variability across these levels. For estimating non-linear indicators like poverty measures, the twofold EBP model can be used. The model relies on normality assumptions of the error terms - a condition often unmet in real data applications. This research enhances the twofold nested error regression model by incorporating data-driven transformations, improving the model's robustness and flexibility. MSE estimation is performed using resampling methods. Model-based simulations compare the proposed model's performance with onefold EBP methods that include either area or sub-area random effects. Results show that the proposed twofold EBP method with data-driven transformation adapts to the distribution shape, thereby providing more efficient estimates than a fixed logarithmic transformation or no transformation. Finally, the twofold EBP with data-driven transformation is used to generate poverty estimates for rural and urban regions within Kenyan counties, offering a more nuanced and accurate assessment of poverty levels.
Small Area Estimation under limited auxiliary data and complex survey data
Neef, S.; Schmid, T.; Würz, N.
Abstract: The paper proposes an Empirical Best Linear Unbiased Predictor that allows for fixed and data-driven transformations under limited auxiliary data while simultaneously adjusting for complex survey designs. Fixed or data-driven transformations are a common method for reducing the skewness of a variable. However, when calculating the area mean and in cases where only limited auxiliary data is available a first and second-order bias are introduced due to Jensen’s inequality (Würz et al., 2022). Additional bias is introduced from disregarding a complex survey design. By incorporating the design weights into the estimation and utilizing KDE, we hope to reduce this bias. We furthermore propose a weighted bootstrap estimator for precise quantification of the variance. The method is applied to data from the Socio-Economic Panel and evaluated through a model-based simulation study. We want to show that compared to already established methods like the (weighted) EBP the proposed estimator produces similar results requiring less information.
Area-level small area estimation with random forests
Harmening, S.; Lee, Y.; Runge, M.; Schmid, T.
Abstract: An approach that combines a small area estimation model with tree-based methods to provide a solution when only area-level data are available is presented, namely the area-level mixed-effects random forest. In particular, the linear regression synthetic part of the Fay-Herriot model is replaced by a random forest to link survey data with related administrative information or data from other sources. By using a random forest, possible interactions and nonlinear relationships are accounted for, and automatic variable selection and robustness to outliers are indirectly provided as a property of the random forest. To obtain point estimates for an indicator of interest, the familiar structure of the Fay-Herriot estimator is retained. The estimation is done by implementing an expectation maximization algorithm. To determine the uncertainty of the point estimator, a nonparametric bootstrap method for estimating the mean squared error is presented. The use of data transformations like the log transformation is investigated in the context of machine learning methods. In particular, a log transformation is applied to the direct estimates and due to the nonlinearity of the logarithm, the final point mixed-effects random forest and mean squared error estimates on the original scale are back-transformed by taking into account a bias-correction. To evaluate the accuracy and precision of the proposed estimator and its uncertainty measure, model-based simulations are carried out. The presented methodology is illustrated by using household survey and remote sensing data from Mozambique to estimate average per capita consumption at a km grid-level.
Small area estimation with generalized random forests: Estimating poverty rates in Mexico
Frink, N.; Schmid, T.
Abstract: Identifying and addressing poverty is challenging in administrative units with limited information on income distribution and well-being. To overcome this obstacle, small area estimation methods have been developed to provide reliable and efficient estimators at disaggregated levels, enabling informed decision-making by policymakers despite the data scarcity. From a theoretical perspective, we propose a robust and flexible approach for estimating poverty indicators based on binary response variables within the small area estimation context: the generalized mixed effects random forest. Our method employs machine learning techniques to identify predictive, non-linear relationships from data, while also modeling hierarchical structures. Mean squared error estimation is explored using a parametric bootstrap. From an applied perspective, we examine the impact of information loss due to converting continuous variables into binary variables on the performance of small area estimation methods. We evaluate the proposed point and uncertainty estimates in both model- and design-based simulations. Finally, we apply our method to a case study revealing spatial patterns of poverty in the Mexican state of Tlaxcala.
For further information please click on the link
Gradient Boosting for Hierarchical Data in Small Area Estimation
Messer, P.; Schmid, T.
Abstract: This paper introduces Mixed Effect Gradient Boosting (MEGB), which combines the strengths of Gradient Boosting with Mixed Effects models to address complex, hierarchical data structures often encountered in statistical analysis. The methodological foundations, including a review of the Mixed Effects model and the Extreme Gradient Boosting method, leading to the introduction of MEGB are shown in detail. It highlights how MEGB can derive area-level mean estimations from unit-level data and calculate Mean Squared Error (MSE) estimates using a nonparametric bootstrap approach. The paper evaluates MEGB's performance through model-based and design-based simulation studies, comparing it against established estimators. The findings indicate that MEGB provides promising area mean estimations and may outperform existing small area estimators in various scenarios. The paper concludes with a discussion on future research directions, highlighting the possibility of extending MEGB's framework to accommodate different types of outcome variables or non-linear area level indicators.
For further information please click on the link???????
- Estimation of the consumer price index with regional weights using small area estimation methods: A case study of Germany
Lee, Y.
Abstract: The consumer price index (CPI) is an important indicator for formulating effective economic policies. Most countries produce a national CPI, while some also publish a sub-national (regional) CPI. In the latter case, states sometimes use national product weights, which do not adequately represent the importance of products at the regional level. An ideal regional CPI uses regional product weights to accurately reflect regional specifics. In this study, I explore the estimation of a regional CPI with regional weights by using an income and consumption survey from Germany. To obtain reliable regional weights, I focus on the estimation of regional expenditures for each product. Estimating regional expenditures is challenging because of small sample sizes, which potentially produce unreliable estimates. I address this problem by using small area estimation models, and I show how a model-based estimation of regional expenditures substantially improves the reliability of estimation in Germany.
- Estimating Disaggregated Mobility Indicators using Area-Level Models on Survey Data
Mühlbauer, M.
Abstract: Statistical indicators are traditionally estimated from survey data using direct estimation meth?ods. Large-scale mobility surveys, such as Mobilit?t in Deutschland 2017 (MiD) (Mobility in Ger?many 2017), are often designed to produce reliable statistical indicators at the level of specific sub?populations defined by geographical regions (e.g., states, districts, counties) or other relevant criteria, such as demographic characteristics (e.g., age, gender). After the data have been collected, there is often a need for indicators at a lower geographical level, for which direct estimation may not provide a satisfactory degree of precision due to insufficient sample sizes. Area-level Small Area Estima?tion (SAE) models exploit correlations between the dependent variable and auxiliary data to provide sufficiently precise estimates at the desired disaggregated level of interest. This paper demonstrates the application of SAE in the context of mobility research using a (transformed) area-level Fay?Herriot model to estimate district-level transport activity measured in Mean Trip Kilometers (MTK) using MiD data. lt answers the question of how reliable disaggregated estimates of MTK and po?tentially other metric mobility indicators can be estimated using this methodology. Covariates are obtained by aggregating data from MiD and the extensive Infas 360 CASA dataset. In a second step, they are selected using a multi-stage procedure incorporating the LASSO regularization technique. The variances of the direct estimates play an important role in the Fay-Herriot model and are esti?mated using a bootstrap which is calibrated on a set of MiD totals. Two distributional assumptions, the Gaussianity of the random effects and the residuals, are made in the modeling process. As the data do not fulfill these assumptions, a logarithrnic transformation is applied, which provides a bet?ter fit to normality. Tue results show a significant spread of the mean trip length over the German districts. The densely populated districts of North Rhine-Westphalia have the shortest average trips, while districts in the northeastern states of Germany are characterized by significantly longer trips.
- Releasing Survey Microdata with Exact Cluster Locations and Additional Privacy Safeguards
Koebe, T.; Arias-Salazar, A.; Schmid, T.
Abstract: Household survey programs around the world publish fine-granular georeferenced microdata to support research on the interdependence of human livelihoods and their surrounding environment. To safeguard the respondents’ privacy, micro-level survey data is usually (pseudo)-anonymised through deletion or perturbation procedures such as obfuscating the true location of data collection. This, however, poses a challenge to emerging approaches that augment survey data with auxiliary information on a local level. Here, we propose an alternative microdata dissemination strategy that leverages the utility of the original microdata with additional privacy safeguards through synthetically generated data using generative models. We back our proposal with experiments using data from the 2011 Costa Rican census and satellite-derived auxiliary information. Our strategy reduces the respondents’ re-identification risk for any number of disclosed attributes by 60-80% even under re-identification attempts.
- A framework for producing small area estimates based on area-level models in R
Harmening, S.; Kreutzmann, A.-K.; Pannier, S.; Salvati, N.; Schmid, T.
Abstract: The R package emdi facilitates the estimation of regionally disaggregated indicators using small area estimation methods and provides tools for model building, diagnostics, presenting, and exporting the results. The package version 1.1.7 includes unit-level small area models that rely on access to micro data which may be challenging due to confidentiality constraints. In contrast, area-level models are less demanding with respect to (a) data requirements, as only aggregates are needed for estimating regional indicators, and (b) computational resources, and enable the incorporation of design-based properties. Therefore, the area-level model (Fay and Herriot 1979) and various extensions have been added to version 2.0.2 of the package emdi. These extensions include amongst others (a) transformed area-level models with back-transformations, (b) spatial and robust extensions, (c) adjusted variance estimation methods, and (d) area-level models that account for measurement errors. Corresponding mean squared error estimators are implemented for assessing the uncertainty. User-friendly tools like a stepwise variable selection function, model diagnostics, benchmarking options, high quality maps and export options of the results enable the user a complete analysis procedure - from model building to diagnostics. The functionality of the package is demonstrated by illustrative examples based on synthetic data for Austrian districts.
- Scale estimation and data-driven tuning constant selection for M-quantile regression
Dwaber, J.; Salvati, N.; Schmid, T.; Tzavidis, N.
Abstract: M-quantile regression is a general form of quantile-like regression which usually utilises the Huber in?uence function and corresponding tuning constant. Estimation requires a nuisance scale parameter to ensure the M-quantile estimates are scale invariant, with several scale estimators having previously been proposed. In this paper we assess these scale estimators and evaluate their suitability, as well as proposing a new scale estimator based on the method of moments. Further, we present two approaches for estimating data-driven tuning constant selection for M-quantile regression. The tuning constants are obtained by i) minimising the estimated asymptotic variance of the regression parameters and ii) utilising an inverse M-quantile function to reduce the e?ect of outlying observations. We investigate whether data-driven tuning constants, as opposed to the usual ?xed constant, for instance, at c=1.345, can improve the e?ciency of the estimators of M-quantile regression parameters. The performance of the data-driven tuning constant is investigated in di?erent scenarios using model-based simulations. Finally, we illustrate the proposed methods using a European Union Statistics on Income and Living Conditions data set.
- Asymptotic distribution of regression quantiles in a mixed effects model
Hensel, S.; Pannier, S.; Schmid, T.; Tzavidis, N.
Abstract: Linear quantile models allow for a robust analysis of the conditional distribution of the variable of interest. The introduction of a random effects term extended their range of application to data with complex dependency structures, as they occur in many studies. This paper proposes a higher theoretical understanding of linear quantile mixed models by analysing the asymptotic behaviour of the corresponding maximum likelihood estimator. We will proof the estimators to be consistent and show that it is asymptotically normally distributed. Additionally, a plug-in variance estimator is derived, and its finite sample behaviour is demonstrated in a simulation study.
