<p><strong>Introduction</strong></p><p>Flood early warning systems (FEWS) can reduce casualties and economic losses (UNEP, 2012). The EC Horizon 2020 project FANFAR (www.fanfar.eu) aims to co-develop a FEWS in West Africa together with stakeholders, predicting streamflow and return period threshold exceedance (Andersson et al., 2020). A Multi-Criteria Decision Analysis (MCDA) indicated, that stakeholders find information accuracy especially important, among a broad set of fundamental objectives (Lienert et al., 2020). Social media have the potential to support accuracy assessment by detecting flood events (Lorini et al., 2019; de Bruijn et al., 2019) due to their large spatial coverage (Restrepo-Estrada et al., 2018). We investigated the potential of social media to assess FANFAR forecast accuracy.</p><p>&#160;</p><p><strong>Research Approach</strong></p><p>FANFAR forecasts are based on HYPE, which is a semi-distributed land-cover and sub-catchment based hydrological model (Arheimer et al., 2020). We lumped the forecasted flood risk (FFR) on a country scale and compared it to flood events detected on Twitter, using an algorithm (FEDA) developed by de Bruijn et al. (2019). FEDA detects flood-related tweet bursts based on regionally and temporally adjusted thresholds (de Bruijn et al., 2019). We compared FEDA detected events with floods from the disaster database EM-DAT (https://www.emdat.be/), to find if tweets indicate flooding. We also compared FEDA to the lumped FFR to identify false positives (FP), false negatives (FN), and true positives (TP), from which we deduced the probability of detection (POD) and false alarm rate (FAR). We further calculated the correlation of single flood-related tweets with the lumped FFR and investigated seasonality, lag, and the influence of rainfall.</p><p>&#160;</p><p><strong>Findings</strong></p><p>The detailed findings are described in Hohmann (2021). FEDA (i.e., tweets) and EM-DAT events (i.e., floods) mostly occurred in the same period. However, FEDA detected shorter and more frequent events than EM-DAT. In the Upper Niger, POD<sub>FEDA</sub> and FAR<sub>FEDA</sub> (deduced from FEDA) were of similar order of magnitude as the POD<sub>S</sub> and FAR<sub>S</sub> (deduced from streamflow) but were different in the Lower Niger region. This suggests that tweets can be employed additionally to e.g. streamflow timeseries as a complementary way to evaluate accuracy. Correlation analysis between single flood-related tweets and the lumped FFR showed no relationship. We also did not find a systematic influence of seasonality or a lagged response between tweets and FFR. The correlation coefficients between tweets and rainfall ranged from 0.1-0.9, but were mostly non-significant. This suggests that a performance assessment based on single tweets is not (yet) adequate. Also, since FEDA does not differentiate between pluvial and fluvial floods, it is less suited to assess the accuracy of FANFAR. Our findings suggest the need for inclusion of other factors into the performance assessment of FEWSs, such as regional thresholds to identify TP, FP, and FN. Also, rainfall causing pluvial flooding must be considered. Finally, our approach is limited to Twitter. Further research should assess the potential of e.g. Facebook to be included in FEWS performance assessment. The question whether social media, FEWSs, or EM-DAT are correct remains, and is in our opinion best addressed by employing multiple data sources.</p>