The Pliocene Model Intercomparison Project Phase 2: large-scale climate features and climate sensitivity

The Pliocene epoch has great potential to improve our understanding of the long-term climatic and environmental consequences of an atmospheric CO2 concentration near ∼ 400 parts per million by volume. Here we present the large-scale features of Pliocene climate as simulated by a new ensemble of climate models of varying complexity and spatial resolution based on new reconstructions of boundary conditions (the Pliocene Model Intercomparison Project Phase 2; PlioMIP2). As a global annual average, modelled surface air temperatures increase by between 1.7 and 5.2 C relative to the pre-industrial era with a multi-model mean value of 3.2 C. Annual mean total precipitation rates increase by 7 % (range: 2 %–13 %). On average, surface air temperature (SAT) increases by 4.3 C over land and 2.8 C Published by Copernicus Publications on behalf of the European Geosciences Union. 2096 A. M. Haywood et al.: Pliocene climate and climate sensitivity over the oceans. There is a clear pattern of polar amplification with warming polewards of 60 N and 60 S exceeding the global mean warming by a factor of 2.3. In the Atlantic and Pacific oceans, meridional temperature gradients are reduced, while tropical zonal gradients remain largely unchanged. There is a statistically significant relationship between a model’s climate response associated with a doubling in CO2 (equilibrium climate sensitivity; ECS) and its simulated Pliocene surface temperature response. The mean ensemble Earth system response to a doubling of CO2 (including ice sheet feedbacks) is 67 % greater than ECS; this is larger than the increase of 47 % obtained from the PlioMIP1 ensemble. Proxy-derived estimates of Pliocene sea surface temperatures are used to assess model estimates of ECS and give an ECS range of 2.6–4.8 C. This result is in general accord with the ECS range presented by previous Intergovernmental Panel on Climate Change (IPCC) Assessment Reports.


From PlioMIP1 to PlioMIP2
The ability of the PlioMIP1 models to reproduce patterns of surface temperature change, reconstructed by marine as well as terrestrial proxies, was investigated via data/model comparison (DMC) in Dowsett et al. (2012; and Salzmann et al. (2013) respectively. Although the PlioMIP1 ensemble was able to reproduce many of the spatial characteristics of SST and 100 surface air temperature (SAT) warming, the models appeared unable to simulate the magnitude of warming reconstructed at the higher latitudes, in particular in the high North Atlantic (Dowsett et al. 2012(Dowsett et al. , 2013Haywood et al., 2013a;Salzmann et al. 2013). This problem has also been reported as an outcome of DMC studies for other time periods including the early Eocene (e.g. Lunt et al., 2012). Haywood et al. (2013aHaywood et al. ( , 2013b discussed the possible contributing factors to the noted discrepancies in DMC, noting three primary causal groupings: uncertainty in model boundary conditions, uncertainty in the interpretation of 105 proxy data and uncertainty in model physics (for example, recent studies have demonstrated that this model-proxy mismatch has been reduced by including explicit aerosol-cloud interactions in the newer generations of models (Sagoo and Storelvmo 2017;Feng et al., 2019)).
These findings substantially influenced the experimental design for the second Phase of PlioMIP (PlioMIP2). Specifically, PlioMIP2 was developed to (a) reduce uncertainty in model boundary conditions and (b) reduce uncertainty in proxy data 110 reconstruction. To accomplish (a), state-of-the-art approaches were adopted to generate an entirely new palaeogeography (compared to PlioMIP1), including accounting for glacial isostatic adjustments and changes in dynamic topography. This led to specific changes, compared to the PlioMIP1 palaeogeography, capable of influencing climate model simulations (Dowsett et al. 2016, Otto-Bliesner et al. 2017. These include the Bering Strait and Canadian Archipelago becoming sub-aerial and modification of the land/sea mask in the Indonesian/Australian region for the emergence of the Sunda and Sahul Shelves. To 115 achieve (b) it was necessary to move away from time-averaged global SST reconstructions, towards the examination of a narrow time slice during the late Pliocene that had almost identical astronomical parameters to the present-day. This made the orbital parameters specified in model experimental design, consistent with the way in which orbital parameters would have influenced the pattern of surface climate and ice sheet configuration preserved in the geological record. Using the astronomical solution of Laskar et al. (2004), Haywood et al. (2013b) identified a suitable interglacial event during the late Pliocene (Marine 120 Isotope Stage KM5c, 3.205Ma). The new PRISM4 (Pliocene Research, Interpretation and Synoptic Mapping version 4) global community-sourced data set of SSTs  targets the same interval in order to produce point-based SST data.
Here we briefly present the PlioMIP2 experimental design, details of the climate models included in the ensemble, as well as the boundary conditions used. Following this, we present the large-scale climate features of the PlioMIP2 ensemble focussed 125 solely on an examination of the control MP simulation designated as a CMIP6 simulation (called midPliocene-eoi400) and its differences to simulated conditions for the pre-industrial era (PI). We also present key differences between PlioMIP2 and PlioMIP1. PlioMIP2 sensitivity experiments will be presented in subsequent studies. We conclude by presenting the outcomes from a DMC using the PlioMIP2 model ensemble and a newly constructed PRISM4 global compilation of SSTs (Foley and 5 Dowsett, 2019), and assess the significance of the PlioMIP2 ensemble in understanding Equilibrium Climate Sensitivity (ECS) 130 and Earth System Sensitivity (ESS).

Boundary Conditions
All model groups participating in PlioMIP2 were required to use standardised boundary condition data sets for the core 135 midPliocene-eoi400 experiment (for wider accessibility this experiment will hereafter be referred to as PlioCore). These were derived from the U.S. Geological Survey PRISM data set, specifically the latest iteration of the reconstruction known as PRISM4 (Dowsett et al. 2016). They include spatially complete gridded data sets at 1° × 1° of latitude/longitude resolution for the distribution of land versus sea, topography, bathymetry, as well as vegetation, soils, lakes and land ice cover. Two versions of the PRISM4 boundary conditions were produced known as enhanced and standard. The enhanced version 140 comprises all PRISM4 boundary conditions including all reconstructed changes to the land/sea mask and ocean bathymetry.
However, groups which are unable to change their land/sea mask can use the standard version of the PRISM4 boundary conditions, which provides the best possible realisation of Pliocene conditions based around a modern land/sea mask. In practice all models except MRI-CGCM2.3 were able to utilise the enhanced boundary conditions.. For full details of the PRISM4 reconstruction and methods associated with its development, the reader is referred to Dowsett et al. (2016: this 145 volume).

Experimental Design
The experimental design for PlioCore and associated PI control experiments (hereafter referred to as PICtrl) was presented in Haywood et al. (2016;this volume), and the reader is referred to this paper for full details of the experimental design. In brief, 150 participating model groups had a choice of which version of the PRISM4 boundary conditions to implement (standard or enhanced). This approach was taken in recognition of the technical complexity associated with the modification of the land/sea mask and ocean bathymetry in some of the very latest climate and earth system models. A choice was also included regarding the treatment of vegetation. Model groups could either prescribe vegetation cover from the PRISM4 dataset (vegetation sourced from Salzmann et al., 2008), or simulate the vegetation using a dynamic global vegetation model. If the latter was 155 chosen, all models were required to be initialised with pre-industrial vegetation and spun-up until an equilibrium vegetation distribution is reached. The concentration of atmospheric CO2 for experiment PlioCore was set at 400 parts per million by volume (ppmv), a value almost identical to that chosen for the PlioMIP1 experimental design (405 ppmv), and in line with the very latest high-resolution proxy reconstruction of atmospheric CO2 of ~400 ppmv for ~3.2 million years ago using Boron isotopes (De La Vega et al. 2018). However, we acknowledge that there are uncertainties on the KM5c CO2 value, hence the 160 6 specification of Tier 1 PlioMIP2 experiments this volume), which have CO2 of ~350 ppmv and 450ppmv will be used to investigate CO2 uncertainty at a later date. All other trace gases, orbital parameters and the solar constant were specified to be consistent with each model's PICtrl experiment. The Greenland ice sheet (GIS) was confined to high elevations in the Eastern Greenland Mountains, covering an area approximately 25% of the present-day GIS. The PlioMIP2 Antarctic ice sheet configuration is the same as PlioMIP1 and has no ice over Western Antarctica. The reconstructed PRISM4 ice sheets 165 have a total volume of 20.1 × 10 6 km 3 , equating to a sea-level increase relative to present day of less than ∼24 m (Dowsett et al. 2016;this volume). Integration length was set to be 'as long as possible', or a minimum of 500 simulated years, however all but two of the modelling groups in PlioMIP2 contributed simulations that were in excess of 1000 years (supplementary   table 1). All modelling groups were requested to fully detail their implementation of PRISM4 boundary conditions, along with the initialisation and spin-up of their experiments in separate dedicated papers that also present some of the key science 170 results from each model, or family of models (see the separate papers within this special volume: https://www.climpast.net/special_issue642.html). NetCDF versions of all boundary conditions used for the PlioCore experiment, along with guidance notes for modelling groups, can be found here: https://geology.er.usgs.gov/egpsc/prism/7.2_pliomip2_data.html.

Participating Models 175
There are currently 16 climate models that have completed the PlioCore experiment to comprise the PlioMIP2 ensemble. These models were developed at different times and have differing levels of complexity and spatial resolution. A further model HadGEM3 is currently running the PlioCore experiment and results from this model will be compared with the rest of the PlioMIP2 ensemble in a subsequent paper. The current 16 model ensemble is double the size of the coupled atmosphere-ocean ensemble presented in the PlioMIP1 large-scale features publication (Haywood et al. 2013a). Summary details of the included 180 models, and model physics, along with information regarding the implementation of PRISM4 boundary conditions and each model's ECS can be found in Table 1 and Supplementary Table 1. Each modelling group uploaded the final 100 years of each simulation for analysis. These were then regridded onto a regular 1° × 1° grid using a bilinear interpolation, to enable each model to be analysed in the same way. Means and standard deviations for each model were then calculated across the final 50 years. 185

Equilibrium Climate Sensitivity (ECS) and Earth System Sensitivity (ESS)
In Section 3.6 we use the PlioCore and PICtrl simulations to investigate ECS and ESS. The PlioCore experiments represent a 400 ppmv world that is in quasi-equilibrium with respect to both climate and ice-sheets and hence represents an 'Earth System' response to the 400ppmv CO2 forcing. The 'Earth System' response to a doubling of CO2 (ie 560 ppmv-280 ppmv; ESS) can 190 then be estimated as follows: There will be errors in the estimate of ESS from the above equation, for example the equation assumes that the sensitivity to 195 CO2 is linearwhich may not necessarily be the case. Further errors will occur because of changes between PlioCore and PICtrl which should not be included in estimates of ESS, such as: land-sea mask changes, topographic changes, changes in soil properties and lake changes. However, all these additional changes are likely minimal compared to the ice sheet and GHG changes and are expected to have only a negligible impact on the globally averaged temperature, and therefore the estimate of ESS. For example, Pound et al. (2014) found that the inclusion of Pliocene soils and lake distributions in a climate model had 200 an insignificant effect on global temperature (even though changes regionally could be important).
To assess the relationship across the ensemble between the reported ECS and the modelled ESS, we correlate reported ECS across the ensemble with the associated PlioCore -PICtrl temperature anomalies. We do this on a global, zonal mean, and local scale. A strong correlation at a particular location would suggest that MP proxy data at that location could be used to derive a proxy-data constrained estimate of ECS (similar to an "emergent constraint"), while a weak correlation would suggest that 205 proxy-data at that location could not be used in ECS estimates. while the lower panel shows the anomaly between them. In this, and all subsequent figures, the models are ordered by ECS (see Table 2) such that the model with the highest published ECS (i.e. CESM2; ECS=5.3) is shown on the left, while the model with lowest published ECS (i.e. NorESM1-F; ECS = 2.3) is on the right. Increases in PlioCore global annual mean SATs, compared to each of the contributing models PICtrl experiment, range from 1.7 to 5.2 °C ( Fig. 1a; Table 2), with an ensemble mean ΔT of 3.2°C. The multi-model median ΔT is 3.0°C, while the 10th and 90 th percentiles are 2. 1°C and 4.8°C, respectively. 215 Analogous results from individual models of the PlioMIP1 ensemble are shown by the grey horizontal lines on Fig. 1a, and have a mean warming of 2.7°C. Pliocene warming for individual PlioMIP1 models falls into two distinct anomaly bands that are 1.8 -2.2°C (CCSM4, GISS-E2-R, IPSLCM5A, MRI2.2) and 3.2 -3.6°C (COSMOS, HadCM3, MIROC4m, NorESM-L).

Surface air temperature (SAT)
PlioMIP2 shows a greater range of responses than PlioMIP1, and PlioMIP2 results are more evenly scattered over the ensemble range. The ensemble mean temperature anomaly is larger in PlioMIP2 than in PlioMIP1 because of the addition of new and 220 8 more sensitive models to PlioMIP2, rather than being due to the change in boundary conditions between PlioMIP1 and PlioMIP2.
PlioMIP2 shows increased SATs over the whole globe (Fig. 1b), with an ensemble average warming of ~2.0°C for the tropical oceans (20°N-20°S), which increases towards the high latitudes (Figs. 1b,c). Multi-model mean SAT warming can exceed 225 12°C in Baffin Bay and 7°C in the Greenland Sea (Fig. 1b), a result potentially influenced by the closure of the Canadian Archipelago and Bering Strait, as well as by the specified loss of most of the Greenland Ice Sheet (GIS), and the simulated reduction in Northern Hemisphere sea-ice cover (de Nooijer et al., 2020). In the Southern Hemisphere, warming is pronounced in regions of Antarctica that were deglaciated in the MP in both west and east Antarctica (Fig. 1b). Warming in the interior of east Antarctica is limited by the prescribed topography of the MP East Antarctic Ice Sheet (EAIS), which in some places 230 exceeds the topography of the EAIS in the models' PICtrl experiments.
In terms of magnitude, the CESM2 model has the greatest apparent sensitivity to imposing MP boundary conditions with a simulated ΔT of 5.2 °C (Fig 1a). This model was published in 2020 and has the highest ECS of all the PlioMIP2 models. This model was not included in PlioMIP1, and its response to Pliocene boundary conditions lies outside the range of all PlioMIP1 models both in global mean and for every latitude band (Figure 1a, 1c). It is also warmer than the PlioMIP2 multi-model mean 235 at nearly all gridboxes ( Supplementary Fig. 1). Other particularly sensitive models (EC-Earth3.3, CESM1.2, CCSM4-Utr and CCSM4-UoT; shown as an anomaly from the multi-model mean in Supplementary Fig. 1) are also new to PlioMIP2 and this explains why the simulated ΔT from PlioMIP2 exceeds that from PlioMIP1. The model with the lowest response to PlioMIP2 boundary conditions is the NorESM1-F model, which is also the model with the lowest published ECS. Although there is clearly some correlation between a model's ECS and its PlioCore -PICtl temperature anomaly, the relationship is not exact. In 240 particular, the versions of CCSM4 that were run by Utrecht University (CCSM4-Utr) and the University of Toronto (CCSM-UoT) both show a large Pliocene response but have a modest ECS compared to the other models.
Three different versions of CCSM4 contributed to PlioMIP2 (see Table 1): the standard version run at NCAR (hereafter referred to as CCSM) has a simulated ΔT = 2.6 °C, while CCSM4-Utr has a simulated ΔT = 4.7 °C and CCSM4-UoT has a simulated ΔT = 3.8 °C. A notable difference between these simulations is the response in the 60°S -90°S band where the 245 mean warming in the CCSM4-Utr simulation is 4 °C higher than in the CCSM4-UoT simulation and 6.6 °C higher than in the CCSM4 simulation ( Fig. 1c; Supplementary Fig. 1). Supplementary Table 1 shows that even though the CCSM4 models differ in their response they all appear to be close to equilibrium. In addition, they are all reported to have similar ECS (Table 1) and they all have the same physics apart from changes to the standard ocean model in the CCSM4-UoT simulations and the PlioCore CCSM4-Utr simulation. These changes (discussed by Chandan & Peltier, 2017, this volume) are: 1. the vertical profile 250 of background diapycnal mixing has been fixed to a hyperbolic tangent form, and 2. tidal mixing as well as dense water overflow parameterization schemes have been turned off. Although the exact cause of the differences in ΔT between the CCSM4 models remains unclear, the changes in the ocean parameterisations and differences in initialization may contribute to the ΔT differences, in particular the changes in ocean mixing between different versions of the model (Fedorov et al., 2010).
Analysis of the standard deviation of the model ensemble ( Fig. 1d) indicates that models are generally consistent in terms of 255 the magnitude of temperature response in the tropics, especially over the oceans. However, they can differ markedly in the higher latitudes, where the inter-model standard deviation reaches more than 4.5°C.
To evaluate whether the multi-model mean PlioCore -PICtrl anomaly at a gridbox is "robust" we follow the methodology of Mba et al. (2018) and Nikulin et al. (2018). The anomaly is said to be "robust" if two conditions are fulfilled: (1) at least 80% models agree on the sign of the anomaly, and (2) the signal to noise ratio (i.e. the ratio of the size of the mean anomaly to the 260 inter-model standard deviation [ Fig. 1b / Fig. 1d]) is greater than or equal to one. Regions where the SAT anomaly is considered robust according to these criteria are hatched in Fig. 2. It is seen that for SAT the PlioCore -PICtrl anomaly is considered robust across the ensemble over nearly all the globe.

Seasonal cycle of surface air temperature, land/sea temperature contrasts and polar amplification 265
The Northern Hemisphere (NH) averaged SAT anomaly over the seasonal cycle is presented in Fig. 3a. Overall, the ensemble mean anomaly (black dashed line) is fairly constant throughout the year, however, models within the ensemble have very different characteristics in terms of the monthly and seasonal distribution of the warming. Some members of the ensemble have a relatively flat seasonal cycle in ΔSAT (e.g. NorESM-L, NorESM1-F, COSMOS), however others show a very strong seasonal cycle. The models that show a very strong seasonal cycle do not agree on the timing of the peak warming. For 270 example, EC-Earth3.3 has the peak warming in October, CESM2 has peak warming in July and MRI2.3 has peak warming in Jan/Feb, The lack of consistency in the seasonal signal of warming has interesting implications in terms of whether PlioMIP2 outputs could be used to examine the potential for seasonal bias in proxy data sets. To do this meaningfully would require clear consistency in model seasonal responses, which is absent in the PlioMIP2 ensemble. The grey shaded area in Fig 3a shows the range of NH temperature response in PlioMIP1, with the PlioMIP1 ensemble average shown by the black dotted 275 line. Although the ensemble average from PlioMIP2 and PlioMIP1 both show a relatively flat seasonal cycle, the range of responses is very different between the two ensembles. PlioMIP1 predicted a large range of temperature responses in the NH winter, which reduced in the summer. In PlioMIP2, however, the summer range is amplified compared to the winter. Indeed 7 of the 16 PlioMIP2 models show a NH summer temperature anomaly that is noticeably above that seen in any of the PlioMIP1 simulations. Some of these models (CESM2, EC-Earth3.3, CCSM4-Utr, CCSM4-UoT and CESM1.2) did not contribute to 280 PlioMIP1, showing that which models are included in an ensemble can strongly affect the ensemble response. However other models (MIROC4m and HadCM3) that show an enhanced summer response in PlioMIP2 were also included in PlioMIP1, showing that there is also an impact of the change in boundary conditions on seasonal temperature. None of the PlioMIP2 models replicate the lowest warming seen in DJF in the PlioMIP1 ensemble, this lowest value was derived from the GISS-E2-R model in PlioMIP1 which did not contribute to PlioMIP2. 285 The ensemble results for land/sea temperature contrasts clearly indicate a greater warming over land than over the oceans (Fig.   3b). This result also holds when only the land/sea temperature contrast in the tropics is considered. The land amplification factor is similar in PlioMIP2 and PlioMIP1, and models in both ensembles cluster near a land amplification factor of ~1.5.
There is also no relationship between a model's climate sensitivity and the land amplification factor. The multi-model median (10 th percentile / 90 th percentile) warming over the land and ocean is 4.5°C (2.6°C / 6.1 °C) and 2.5°C (1.9°C / 4.4 °C) 290 respectively.
The extratropical NH (45°N-90°N) warms more than the extratropical Southern Hemisphere (SH) (45°S-90°S) in 5 of the 8 models (62%) from PlioMIP1 and in 11 of the 16 models (69%) from PlioMIP2 (Fig. 3c). This shows that neither the change in boundary conditions nor the addition of newer models to PlioMIP2 affects the ensemble proportion of enhanced NH warming. Neither does the published ECS have any obvious impact on whether the warming is concentrated in the NH or the 295 SH. The models that indicate greater SH versus NH warming (CCSM4-Utr, GISS2.1G, NorESM-L), are among those that have weaker differences between land and ocean warming (Fig 3b).
Polar amplification (PA) can be defined as the ratio of polar warming (poleward of 60° in each hemisphere) to global mean warming (Smith et al 2019). The PA for each model for the NH and the SH is shown in Fig. 3d. All models show PA > 1 for both hemispheres, although whether there is more PA in the NH or SH is a model dependent feature. The ensemble mean 300 (median) PA is 2.3 (2.2) in both the NH and the SH, suggesting that across the ensemble PA is hemispherically symmetrical.
This result is very similar to PlioMIP1 (not shown), which suggests that the enhanced warming in the PlioMIP2 ensemble does not affect the PA. For PlioMIP2, the NH median PA is 2.2, with the 10 th and 90 th percentiles at 1.9 and 2.8 respectively, while in the SH the median PA is 2.2, with the 10 th and 90 th percentiles at 1.8 and 3.1 respectively. Polar amplification is lower over the land than the ocean (Supplementary Figure 2) in both hemispheres. The NH mean (10 th / 50 th / 90 th percentiles) PA is 305 1.6 (1.4 / 1.6 / 1.9) and 2.7 (2.4 / 2.7 / 3.3) over the land and ocean respectively, while the SH mean (10 th / 50 th / 90 th percentiles) PA is 0.9 (0.5 / 0.8 / 1.5) and 1.9 (1.1 / 1.9 / 2.5) over the land and ocean respectively. Note that in the SH total PA is higher than both land and ocean PA because of the change in the area of land between the PlioCore and PICntl experiments. There appears to be a weak relationship between the PA factor and a model's ECS. Those models which have a lower published ECS (those to the right of Fig. 3d) have a tendency towards higher PA. This is not because these models have excess warming 310 at high latitudes, rather these models have less tropical warming than other models.

Meridional/zonal SST gradients in the Pacific and Atlantic
There has been great interest in the reconstruction of Pliocene SST gradients in the Atlantic and Pacific to provide first order assessments of Pliocene climate change, and to assess possible mechanisms of Pliocene temperature enhancement and 315 ocean/atmospheric dynamic responses (Rind and Chandler, 1991). For example, the meridional gradient in the Atlantic has been discussed in terms of the potential for enhanced Ocean Heat Transport in the Pliocene (e.g. Dowsett et al., 1992). In addition, the zonal SST gradient across the tropical Pacific has been used to examine the potential for change in Walker Circulation and, through this, ENSO dynamics and teleconnection patterns during the Pliocene (Fedorov et al., 2013;Burls and Fedorov, 2014;Tierney et al., 2019). 320 The multi-model mean meridional profile of zonal mean SSTs in the Atlantic Ocean is shown in Fig. 4a. In the tropics and sub-tropics, the SST increase between the PlioCore and PICtrl experiments is 1.5-2.5°C. This difference increases to ~5.0°C in the NH at ~55°N, with an inter-model range of 2°C -11°C. The Pliocene and Pre-industrial meridional SST profile in the Pacific (Fig. 4b) is similar to that of the Atlantic, but with little indication from the multi-model mean for a high latitude enhancement in meridional temperature. However, a large range in the ensemble response is noted, and the importance of an 325 adjustment of the vertical mixing parameterization towards simulation of a reduced Pliocene meridional gradient has been recently shown (Lohmann et al., in review).
In the tropical Atlantic (20°N -20°S) the multi-model mean zonal mean SST for the Pliocene increases by ~1.9°C (ensemble range from 0.8°C to 3-4°C), with a flat zonal temperature gradient across the tropical Atlantic (Fig. 4c). In the tropical Pacific both Pliocene and pre-industrial ensembles clearly show the signature of both a western Pacific Warm Pool, and the relatively 330 cool waters in the eastern Pacific that are associated with upwelling ( Fig. 4d). As such, a clear east-west temperature gradient is evident in the Pliocene tropical Pacific in the PlioMIP2 ensemble (similar to PlioMIP1) and is not consistent with a permanent El-Niño (see Supplementary Fig. 3). The PlioMIP2 ensemble supports a recent proxy-derived reconstruction for the Pacific that found Pliocene ocean temperatures increased in both the eastern and western Tropical Pacific (Tierney et al., 2019). 335 Using the methodology of Mba et al. (2018) and Nikulin et al. (2018), the signal of SST change seen in the multi-model mean is robust over nearly all ocean grid cells ( Supplementary Fig. 4). Supplementary Fig. 3 shows the difference between the Pliocene ΔSST for each model in the PlioMIP2 ensemble and the Pliocene ΔSST of the multi-model mean. This illustrates that despite the climate anomaly being larger than the inter-model standard deviation there are still many regions (e.g. Southern Ocean, North Atlantic Ocean, Arctic Ocean) where there is a notable inter-model spread of the magnitude of the Pliocene SST 340 anomalies.

Total precipitation rate
Simulated increases in PlioCore global annual mean precipitation rates, compared to each contributing model's PICtrl experiment, (hereafter referred to as ΔPrecip) ranges from 0.07 to 0.37 mm/day (Fig. 5a), which is notably larger than the 345 PlioMIP1 range of 0.09-0.18 mm/day (shown as horizontal grey lines on Fig. 5a). The PlioMIP2 ensemble mean ΔPrecip is 0.19 mm/day. The increase in the globally averaged precipitation anomaly in PlioMIP2 is due to the addition of new models to the ensemble, which have high ECS and are also more sensitive to the PlioMIP2 boundary conditions. Models that were included in PlioMIP1 (COSMOS, IPSLCM5A, MIROC4m, HadCM3, CCSM4, NorESM-L and MRI2.3) show PlioMIP2 precipitation anomalies that are similar to PlioMIP1 results. The spatial pattern (Fig. 5b) shows enhanced precipitation over 350 high latitudes and reduced precipitation over parts of the subtropics. The largest ΔPrecip is found in the tropics, in regions of the world that are dominated by the monsoons (West Africa, India, East Asia). The enhancement in precipitation over North Africa is consistent with previous Pliocene modelling results that have demonstrated a weakening in Hadley Circulation linked to reduced pole to equator temperature gradient (e.g. Corvec and Fletcher 2017). Greenland shows increased PlioCore precipitation in regions that have become deglaciated and are therefore substantially warmer. Latitudes associated with the 355 westerly wind belts also show enhanced PlioCore precipitation, with an indication of a poleward shift in higher latitude precipitation. This result is consistent with findings from PlioMIP1 (Li et al. 2015). Other, more locally defined ΔPrecip appears closely linked to localised variations in Pliocene topography and land/sea mask changes, for example, the Sahul and Sunda Shelf that become subaerial in the PlioCore experiment. In general, the models that display the largest SAT sensitivity to the prescription of Pliocene boundary conditions also display the largest ΔPrecip (CESM2, CCSM4-Utr, EC-Earth3.3). This 360 is consistent with a warmer atmosphere leading to a greater moisture carrying capacity and therefore greater evaporation and precipitation. The model showing the least sensitivity in terms of precipitation response is GISS2.1G.
Analysis of the standard deviation within the ensemble demonstrates that, in contrast to SAT, models are most consistent regarding ΔPrecip in the extratropics (Fig. 5c). This is similar to the findings of PlioMIP1 (Haywood et al., 2013a) and is likely because more precipitation falls in the tropics rather than extratropics, and therefore the inter-model differences are 365 larger in the tropics. The methodology of Mba et al. (2018) and Nikulin et al. (2018) (described in section 3.1), was used to determine the robustness of ΔPrecip (Fig. 5d). Unlike the temperature signal, which was robust throughout most of the globe, there are large regions in the tropics and subtropics where the ensemble precipitation signal is uncertain. Changes in precipitation rates in the subtropics have some inter-model coherence in many places because at least 80% of models agree on the sign of change. However, most of these predicted changes are not robust because the magnitude of change is not large 370 compared to the standard deviation seen in the ensemble (Fig. 5c). This is consistent with results from CMIP5, which show predicted precipitation changes have low confidence particularly in the low and medium emissions scenarios (IPCC, 2013).
The signal of precipitation change is determined to be robust in the high latitudes and in the mid-latitudes in regions influenced by the westerlies. This is also the case in regions influenced by the West African, Indian and East Asian Summer Monsoons (Fig. 5d). Supplementary Fig. 5 shows the difference between each model's ΔPrecip and the multi-model mean ΔPrecip 375 (shown in Fig. 5b), highlighting that there is uncertainty in the ensemble with respect to the regional patterns of precipitation change. Figure 6a shows the seasonal cycle of the precipitation anomaly averaged over the Northern Hemisphere. As was the case for 380 SAT, the monthly and seasonal distribution of precipitation anomalies are highly model dependent, although the ensemble average shows a clear NH late spring to autumn PlioCore enhancement in precipitation (Fig. 6a). This is most strongly evident in the models CESM2, EC-Earth3.3 and CCSM4-Utr, however it is also evident in other models. Some models show a different seasonal cycle to the ensemble mean, for example the GISS2.1G model simulates the NH late spring to autumn ΔPrecip being supressed compared to the rest of the year, and HadCM3 which has a bimodal distribution. An increase in NH summer 385 precipitation is consistent with a general trend of West African, Indian and East Asian Summer monsoon enhancement, and this will be discussed in detail in a forthcoming PlioMIP2 paper. In PlioMIP1 (ensemble average -dotted black line and model rangeshaded grey area in Fig. 6a) the seasonal cycle in precipitation was much more muted. PlioMIP1 results in the boreal winter are similar to PlioMIP2, however the mean precipitation anomaly in PlioMIP2 between June and November is 40% larger than PlioMIP1. This increase is mainly due to the inclusion of new and more sensitive models into PlioMIP2 (e.g. 390 CESM2 and EC-Earth3.3). However, some models with enhanced summer precipitation (e.g. COSMOS) contributed to both PlioMIP1 and PlioMIP2 suggesting a role of boundary condition changes in enhancing the NH boreal summer precipitation.

Seasonal cycle of total precipitation and land/sea precipitation contrasts
It is noted, however, that not all the new models in PlioMIP2 show enhanced summer precipitation relative to PlioMIP1. The GISS2.1G model, which was new to PlioMIP2, shows the most muted summer precipitation response in the NH in all PlioMIP2 and PlioMIP1 models. This means that the range of summer/autumn NH precipitation responses as shown by the ensemble 395 increases significantly in PlioMIP2. For example, PlioMIP1 showed a NH precipitation response in October to be 0.13-0.42 mm/day while in PlioMIP2 this has increased to 0.05-0.70 mm/day.
In terms of the land/sea ΔPrecip contrast the PlioMIP2 ensemble divides into two groups (Fig. 6b)

14
This section will consider the relationship between ECS and ESS across the ensemble. Table 2 shows the ECS for each model (referenced in Table 1) and the ESS estimated from the PlioCore -PICtrl temperature anomaly (equation 1). Since ice sheet changes were prescribed, there will be no transient response due to ice sheet changes, and the PlioCore experiment will be in equilibrium with the ice sheets. The mean ESS / ECS ratio is 1.67, suggesting that the ESS based on the ensemble is 67% larger than the ECS, however the range is large with the GISS2.1G model suggesting that the ESS / ECS ratio is 1.22 while 415 the CCSM4_Utr model suggests that the ESS / ECS ratio is 2.85 The first analysis of how ECS relates to ESS will consider the correlation between ECS and the globally averaged PlioCore -PICtrl temperature anomaly. This is seen in Fig. 7a, and each cross represent the results from a different model in the PlioMIP2 ensemble. There is a significant relationship between ECS and the PlioCore -PICtrl temperature anomaly at the 95% confidence level (p=0.01, R 2 =0.35) with the line of best fit: ECS = 2.3 + (0.44 × (PlioCore(SAT) -PICtrl(SAT))). 420 Next, we investigate whether there is a correlation across the ensemble between ECS and the PlioCore -PICtrl SAT anomaly on spatial scales. In the analysis that follows we will simply assess whether such a correlation exists and if so, how strong it is, by looking at p-values and R-squared values, calculated from the models in the ensemble. Fig. 7b shows the relationship (pvalueblue, R-squared -red) across the ensemble between modelled ECS and the modelled zonal mean PlioCore -PICtrl SAT anomaly. We find a significant relationship (p < 0.05) between ECS and the zonal mean Pliocene temperature anomaly 425 throughout most of the tropics. This relationship becomes significant at the 99% confidence level (p < 0.01) between 38°N and 27°S, where a high proportion of the inter-model variability in global ECS can be related to the inter-model variability in the Pliocene SAT anomaly at an individual latitude, reaching a maximum of 65% at ~15°N.
Next the relationship between global ECS and the local PlioCore -PICtrl SAT anomaly is assessed. In Fig. 7c colours show the R-squared correlation across the ensemble between modelled global ECS and modelled local PlioCore -PICtrl SAT anomaly. 430 The regions where the relationship between the two is significant at the 95% confidence level is hatched. The relationship between ECS and the local PlioCore -PICtrl SAT anomaly is significant over most of the tropics, and over some mid and high latitude regions including Greenland and parts of Antarctica. In many cases, the tropical oceans show a temperature anomaly more strongly related to ECS than the land, although this is not always the case.  Haywood et al (2013a, b) proposed that the proxy data/climate model comparison in PlioMIP1 could include discrepancies owing to the comparison between time averaged PRISM3D SST and SAT data, and climate model representations of a single time slice. In order to improve the integrity of the data/model comparisons in PlioMIP2, Foley and Dowsett (2019) synthesised alkenone SST data that can be confidently attributed to the MIS KM5c time slice that experiment PlioCore is designed to 440 represent. Foley and Dowsett (2019) provide two different SST data sets. One data set includes all SST data for an interval of 10,000 years around the time slice (5,000 years to either side of the peak of MIS KM5c) and the other covers 30,000 years 15 (up to 15,000 years to either side of the peak; this latter dataset will hereafter be referred to as F&D19_30). Age models used in the compilation are those originally released with the data sets, but later modifications of age models or the integration of additional data could result in mean SST values different from those reported in F&D19_30. All SST estimates are calibrated 445 using Müller et al. (1998). Prescott et al. (2014) demonstrated that due to the specific nature of orbital forcing 20,000 years before and after the peak of MIS KM5c, age and site correlation uncertainty within that interval would be unlikely to introduce significant errors into SST-based DMC. Given this, and in order to maximise the number of ocean sites where SST can be derived, we carry out a point-based SST data/model comparison using the F&D19_30 data set.

Data/Model Comparison
We compare the multi-model mean SST anomaly to a proxy SST anomaly created by differencing the F&D19_30 data set 450 from observed pre-industrial SSTs derived for years 1870-1899 of the NOAA ERSST version 5 data set (Huang et al., 2017;Fig. 8a and Fig. 8b). Fig. 8c shows the proxy data ΔSST minus the multi-model mean ΔSST. Using the multi-model mean results, 17 of the 37 sites show a difference in model/data ΔSST of no greater than +/-1°C (Fig. 8c). These are located mostly in the tropics, but also include sites in the North Atlantic, along the coastal regions of California and New Zealand and in the North Pacific. In terms of discrepancies, the clearest and most consistent signal comes from the Comparing model predicted and proxy based absolute SST estimate for the MIS KM5c time slice (Fig. 8d) yields a similar outcome to the comparison of SST anomalies (Fig. 8c). However, the Benguela region shows greater model-data agreement when considering absolute SSTs than when considering anomalies. Furthermore, a somewhat clearer picture emerges of the model ensemble not producing SSTs that are warm enough in the higher latitudes of the North Atlantic, in the position of the 465 modern North Atlantic gyre, and especially in the Nordic Sea. Although this appears site dependant as the ensemble overestimates absolute SSTs near Scandinavia.
The proxy data ΔSST minus the mean ΔSST for individual models is shown in Supplementary Fig. 6. In regions where there was a strong discrepancy between the proxy data ΔSST and the multi-model mean ΔSST none of the individual models show good model-data agreement. The EC-Earth3.3 model shows an improved agreement with the data in the Mediterranean, the 470 Benguela upwelling system, the site along the East Coast of North America and the site to the West of Svalbard. However, this improved model-data agreement is at the expense of reduced model-data agreement elsewhere: many of the low and midlatitude sites, which had good model-data agreement for the multi-model mean have reduced model-data agreement in EC-Earth3.3. Other models, which showed large warming in PlioMIP2 (i.e. CESM2 and CCSM4-Utr) also show a larger ΔSST than the data for some of the tropical and mid-latitude sites which were in good agreement with the multi-model mean. Models 475 16 that were less sensitive to Pliocene boundary conditions (i.e. GISS2.1G and NorESM-L) do not predict the amount of warming seen in the data for some of the North Atlantic sites and the multi-model means performs better. Table 3 shows statistics for the data-model comparison for both individual models and the multi-model mean. The root mean square error (RMSE) between the model and the data is 3.72 for the multi-model mean, but is lower in some individual models (namely CESM2, IPSLCM6A, EC-Earth3.3, CESM1.2, CCSM4-UoT). In general, those models that have a lower model-data RMSE are those 480 which have higher ECS and a higher PlioCore -PICtrl warming, while less sensitive models have a higher model-data RMSE.
The average difference between the data and model across all the data points shows a similar pattern. The proxy data is on average 1.5°C warmer than the multi-model mean. However, some individual models have a much smaller average modeldata discrepancy (e.g. CESM2 = -0.18°C). The models with a lower model-data discrepancy are those which also have a lower model-data RMSE and have higher than average PlioCore -PICtl warming. 485 This initial analysis suggests that the most sensitive models agree better with the proxy data than the less sensitive models.
However further analysis does not fully support this result. If we consider how many of the 37 sites have 'good' model-data agreement a different picture emerges. Table 3  suggests that proxy-based SST anomaly estimates can be used to infer global mean SAT anomalies, provided that enough SST 500 proxy data is available to reliably estimate SST anomalies. The multi-model median ratio of ΔSAT / ΔSST is 1.4, while the multi-model median ratio of ΔSST to ΔSST (60°N-60°S) is 1.5.

Large-scale features of a warmer climate (palaeo vs future, older vs younger models) 505
The range in the global annual mean ΔSAT shown by the PlioMIP2 ensemble (from 1.7 to 5.2°C) is akin to the best estimate (and uncertainty bounds) of predicted global temperature change by 2100CE using the RCP4.5 to 8.5 scenarios (RCP4.5 = 1.8 ± 0.5 °C and RCP8.5 = 3.7 ± 0.7 °C IPCC, 2013; Table 12.2). Comparing the degree of Pliocene temperature change to predicted changes at 2300 CE, the multi-model mean SAT change is between RCP4.5 (2.5 +/-0.6 °C) and RCP6.0 (4.2 +/-1.0

°C). 510
Studies have suggested that the Arctic temperature response to a doubling of atmospheric CO2 concentration may be 1-3 times that of the global annual mean temperature response (Hind et al., 2016). All 16 models within the PlioMIP2 ensemble simulate a polar amplification factor (PA; averaged over the NH and SH) between 2 and 3 (meaning that the high latitude temperature increase is 2-3 times the global mean temperature increase) however, 2 models (GISS2.1G and NorESM-L) show PA > 3 in the SH. An important caveat to note in the comparison between Pliocene and future predicted polar amplification factors is 515 the major changes in the size of the ice sheets, which in terms of area-of-ice difference affect the SH far more than the NH.
Both model simulations and observations (Byrne and O'Gorman, 2013) show that as temperatures rise, the land warms more than the oceans. This is due to differential lapse rates linked to moisture availability on land. From a theoretical standpoint the difference in land/sea warming is expected to be monotonic with increases in temperature. However, in reality the rise is non-monotonic and is regulated by latitudinal and regional variations in the availability of soil moisture that influences lapse 520 rates (Byrne and O'Gorman, 2013). This is evident in the PlioMIP2 ensemble with land/sea amplification of warming noted more strongly in the global mean than in the tropics where precipitation is most abundant (Fig. 3b). For perturbations to the pre-industrial, modelling and observational studies have shown that land warms 30 to 70% more than the oceans (Lambert and Webb, 2011). The PlioMIP2 ensemble broadly supports this conclusion and previous work. It also supports studies that have indicated that the land/sea warming contrast is not dependent upon whether we are considering a transient (RCP-like) or an 525 equilibrium-type climate change scenario (e.g. Lambert & Webb, 2011).
In predictions of future climate change, a consistent result from models is that the warming signal is amplified in the Northern compared to Southern Hemisphere in the extratropics. There have been several studies which have proposed mechanisms to explain this, including heat uptake by the Southern Ocean (Stouffer et al., 1989) as well as ocean heat transport mechanisms (Russell et al., 2006). Within the PlioMIP2 ensemble, 11 out of 16 models show a larger temperature change in the NH 530 extratropics than the SH extratropics (Fig. 3c). This can in part be explained by the area of land in the NH being larger than in the SH and the already discussed amplification of warming over the land versus the oceans. However, the degree of difference is highly model dependent and not as large as has been reported for simulation of future climate change by the IPCC (IPCC, 2013). This may be linked to the intrinsic difference in response between a RCP-like transient and equilibrium climate experiment, and in the Pliocene substantially reduced ice sheets on Antarctica, which are not specified in future climate change 535 simulations. Hence, the noted hemispheric difference in warming for the future may simply be a transient feature that would not be sustained as the ice sheets on Antarctica responded to the warming over centennial to millennial timescales.
The 3.2°C increase in multi-model mean temperature is associated with a 7% increase in global annual mean precipitation.
According to the Clausius-Clapeyron equation, the water holding capacity of the atmosphere increases by about 7% for each 1°C of temperature increase. The increase in precipitation is therefore less than would be expected if it were assumed that all 540 aspects of the hydrological cycle remained the same as pre-industrial. This is in line with model simulations of future climate change linked to greater temperatures enhancing evaporation from the surface and the atmosphere having a greater moisture carrying capacity, but sluggish moist convection (Held and Soden, 2006).
A particularly robust feature across the ensemble is an increase in precipitation over the modern Sahara Desert and over the Asian monsoon region (Figure 5d). These regions also experience enhanced precipitation under the RCP8.5 scenario for 2100 545 (IPCC, 2013; Figure SPM.7). However, in other tropical and subtropical regions the PlioCore model response is small compared to the pre-industrial inter-model standard deviation. Corvec and Fletcher (2017) showed that in PlioMIP1 studies the tropical overturning circulations in the mPWP were weaker than pre-industrial simulations, while Sun et al. (2013) showed that both Hadley Cells expanded polewards, a result consistent with (but weaker than) the RCP4.5 scenario. These changes in circulation are consistent with the expansion of the subtropical 550 highs and the corresponding reduction in subtropical oceanic precipitation seen in Figure  It has been seen that some of the main differences between the PlioMIP1 and PlioMIP2 ensembles are due to the inclusion of new models in PlioMIP2 that were not available at the time of PlioMIP1. This suggests that recent developments in model 560 physics lead to an altered responses to Pliocene boundary conditions. For example, PlioMIP2 includes three IPSL models: IPSLCM5A (ΔSAT = 2.3°C), IPSLCM5A2 (ΔSAT = 2.2°C) and IPSLCM6A (ΔSAT = 3.4°C) while only IPSLCM5A contributed to PlioMIP1. PlioMIP2 includes three models from NCAR: CCSM4 (ΔSAT = 2.6°C), CESM1.2 (ΔSAT = 4.0°C) and CESM2 (ΔSAT = 5.2°C) while only CCSM4 contributed to PlioMIP1. Within both these model families, the newer versions provide a greater response to the Pliocene boundary conditions and also have a higher published climate sensitivity. 565 Across the ensemble there is a significant correlation between sensitivity and model resolution (supplementary Fig. 8), with a larger temperature anomaly and precipitation anomaly predicted in higher resolution models (p < 0.05). This suggests that low resolution models may not be able to capture the full extent of climate change shown by higher resolution models.
However, it is noted that these relationships are only statistical correlations and some models do not show the same pattern.
For example, the CCSM4-Utr model has much greater temperature and precipitation anomalies, and CCSM4 has lower 570 temperature and precipitation anomalies than other models of a similar resolution.

Model representations of Pliocene climate vis-a-vis proxy data
One of the most fundamental changes in experimental design between PlioMIP2/PRISM4 and PlioMIP1/PRISM3D was the approach towards geological data synthesis for data/model comparison, in particular, moving from SST and vegetation 575 estimates for a broad time slab to a short SST time series encompassing the MIS KM5c timeslice. This was necessary in order to assess to what degree climate variability within the Pliocene could affect the outcomes of data/model comparison and, fundamentally, to derive greater confidence in the outcomes which could be derived from Pliocene data/model comparison (Haywood et al., 2013a, b). In addition, PlioMIP2 contains many new models not used in PlioMIP1, and the PlioMIP2 boundary conditions have changed compared to PlioMIP1 (particularly the Land-Sea Mask, and the topography). 580 Nevertheless, what emerges from the comparison of the PlioMIP2 SST ensemble to the F&D19_30 SST data set is a nuanced picture of widespread model/data agreement with specific areas of concern.
Data model comparisons undertaken for PlioMIP1 indicated that the PlioMIP1 ensemble overestimated the amount of SST change as a zonal mean in the tropics (Dowsett et al., 2012;Fedorov et al., 2013). In PlioMIP2 point-based comparisons, there is little indication of a systematic mismatch between the data and the models. Models and proxy data appear to be broadly 585 consistent in the tropics. The F&D19_30 data set is comprised of alkenone-based SSTs. In contrast, the PRISM3D data set used for DMC in PlioMIP1 was time averaged and composed of estimates from a combination of faunal analysis, Mg/Ca and alkenone-based SSTs. Tierney et al. (2019) demonstrated that the PlioMIP1 ensemble compared well to alkenone-based SST estimates in the tropical Pacific for the whole mid-Pliocene Warm Period, not just the PlioMIP2 time slice, when the alkenonebased temperatures were recalculated using the BAYSPLINE calibration. Therefore, the choice of proxy and inter-proxy 590 calibration alone can be enough to alter the interpretation of the extent to which the model and data agree. In addition, comparing the PlioMIP2 results to an additional dataset of published SSTs for the timeslice (McClymont et al., this issue), we see that the first order outcome of model-data comparison is the same as that shown by the comparison to F&D19_30.
The Pliocene minus pre-industrial SST anomaly will not only depend on which SST dataset is chosen to represent the Pliocene, but also on the choice of observed SST data set used for the pre-industrial. Supplementary Fig. 9 shows the proxy data 595 reconstructed SST change using the F&D19_30 data set but using two different observed data sets for pre-industrial SSTs to create the required proxy data SST anomaly. Using recently released NOAA ERSST V5 data set (Huang et al., 2017) to create the anomaly instead of the older HadISST data (Rayner et al., 2003) leads to three sites in the North Atlantic showing a muchreduced Pliocene warming. It also means that several sites in the tropics now show a small (2 to 3°C) warming during the Pliocene, while using HadISST data led to an absence of SST warming at these locations. The difference between using NOAA 600 ERSST V5 or HadISST is sufficiently large that it can determine whether the PlioMIP2 ensemble is able to largely match (or mismatch) the proxy-reconstructed temperatures.

20
Another region of data/model mismatch noted in PlioMIP1 was the North Atlantic Ocean (NA). Haywood et al. (2013a) noted a difference in the model-predicted (multi-model mean) versus proxy reconstructed (PRISM3D) warming signal of between 2 to 7°C in the NA. The PlioMIP2 multi-model mean SST anomaly appears to be broadly consistent with the F&D19_30 data 605 set, with a SST anomaly at two sites matching to within 1°C and the other to within 3°C (Fig. 8). There are several possible ways to account for this apparent improvement. Firstly, the total number of sites in the NA in F&D19_30 is reduced compared to the PRISM3D SST data set . The site that led to the 7°C difference noted in Haywood et al. (2013a) is not present in the F&D19_30 data set. Secondly, the PlioMIP2 experimental design specified both the Canadian Archipelago and Bering Strait as closed. Otto-Bliesner et al. (2017) performed a series of sensitivity tests based on the NCAR CCSM4 610 PlioMIP1 experiment and found the closure of these Arctic gateways strengthened the AMOC by inhibiting transport of less saline waters from the Pacific to the Arctic Ocean and from the Arctic Ocean to the Labrador Sea, leading to warming of NA SSTs. Dowsett et al. (2019) also demonstrated an improved consistency between the proxy-based SST changes and modelpredicted SST changes after closing these Arctic gateways in models. It is therefore likely that the multi-model mean SST change in the NA in PlioMIP2 has been influenced by the specified change in Arctic gateways leading to a regionally enhanced 615 fit with proxy data. However, the question regarding the veracity of the specified changes in Arctic gateways in the PRISM4 reconstruction, given the uncertain and lack of geological evidence either way, remains open and requires further study.
One of the clearest data/model inconsistencies occurs in the Benguela upwelling system, where proxy data indicates more SST warming than the multi-model mean. The simulation of upwelling systems is particularly challenging for global numerical climate models due to the spatial scale of the physical processes involved, and the capability of models to represent changes 620 in the structure of the water column (thermocline depth) as well as cloud/surface temperature feedbacks. Dowsett et al. (2013) noted SST discrepancies between the PRISM3D SST reconstruction and the PlioMIP1 ensemble. Their analysis of the seasonal vertical temperature profiles from PlioMIP1 for the Peru Upwelling region indicated that models produced a simple temperature offset between PI and the Pliocene but did not simulate any change to thermocline depth.
An assumption that proxy-data truly reflect mean annual SSTs in upwelling regions is also worthy of consideration. In 625 upwelling zones, nutrients (and relatively cold waters) are brought to the surface increasing productivity. The upwelling of nutrient rich waters is often seasonally modulated, which could conceivably bias alkenone-based SSTs to the seasonal maximum for nutrient supply and therefore coccolithophore productivity and/or alkenone flux. In the modern ocean, across the most intense region of Benguela upwelling, the productivity seems to be year-round, whereas the southern Benguela has highest productivity during the summer (Rosell-Melé and Prahl, 2013). Ismail et al. (2015), based on observational data, 630 demonstrated that it was surface heating, not vertical mixing related to upwelling, which controls the upper ocean temperature gradient in the region today. This lends some credence to the idea that the observed mismatch between PlioMIP2 ΔSST and the F&D19_30 proxy-based anomaly could arise from the complexities/uncertainties associated with interpreting alkenonebased SSTs in the region as simply an indication of mean annual SST (Leduc et al. 2014). However, we note that no seasonal bias has been identified in the modern dataset in the Benguela region (Tierney and Tingley, 2018). 635 21

Equilibrium Climate Sensitivity, Earth System Sensitivity and Pliocene climate
From the analysis shown in section 3.6 a strong relationship between ECS and the ensemble-simulated Pliocene temperature anomaly is discernible. This point is true for the globally averaged temperature anomaly, latitudinal average temperature anomalies in the tropics and specific gridbox based temperature anomalies over large portions of the globe. Across the 640 ensemble, the tropical Pliocene temperature anomaly is more strongly related to ECS than other latitudes, both as a latitudinal mean and also when considering individual gridpoints. On a gridpoint by gridpoint basis, the tropical oceans are strongly related to modelled ECS, suggesting that SST data from the Pliocene tropics has the potential to constrain model estimates of ECS, highlighting the benefits for deriving estimates of ECS from a concentrated effort to reconstruct tropical SST response using the geological record. 645 For PlioMIP1, Hargreaves and Annan (2016) also found that modelled PlioCore -PICrtl SST anomalies over the tropics (30°N-30°S) were correlated with modelled ECS, according to: 650 Where α and C are constants, and ε represents all errors in the regression equation. They then used equation (2) along with tropical SST data from PRISM3D (an interpolated dataset of Pliocene proxy SST) to provide a Pliocene data constrained estimate of ECS of 1.9°C -3.7°C. In order to constrain ECS from the data and modelling used in PlioMIP2, we slightly amend the Hargreaves and Annan (2016) methodology because PlioMIP2 proxy data is more sparsely distributed than PlioMIP1 proxy data and we cannot obtain a reliable estimate of tropical average SST from the data available. To estimate ECS for 655 PlioMIP2 we instead rely on point-based observations (Fig. 8a) and local regressions between PlioCore -PICrtl SST and modelled ECS (Figure 7c). Hence, we apply equation (2) with ΔSST from individual data sites, and α and C will now be location dependent. Using this altered methodology, a different estimate of ECS is obtained for each datapoint, these estimates are shown in Fig. 9, and have a range of 2.6°C -4.8°C with a mean ECS of 3.6°C and a standard deviation of 0.6°C. Fig. 9 does not imply that ECS is different for each location, instead each value in Fig. 9 is an estimate of ECS and incorporates the 660 true Pliocene constrained ECS along with several errors. For a data-point to be included in Fig. 9 we required that two conditions were met. Firstly, we required that the relationship between local PlioCore -PICrtl and a model's ECS was significant at the 95% confidence level (p < 0.05; these regions are hatched in Fig. 7d). Secondly, we required that at least one of the models in the PlioMIP2 ensemble was within 1°C of the data; this second condition meant that we excluded two sites off the Eastern United States, two sites from the Mediterranean, and two site from Bengueladespite these sites showing a theoretical 665 relationship between PlioCore -PICrtl and ECS. Altogether 13 datapoints fulfilled both these conditions and could be used to estimate ECS. The range of estimates of ECS from PlioMIP2 (2.6°C-4.8°C) are similar to IPCC (1.5°C -4.5°C) but are slightly larger than was estimated from PlioMIP1 (1.9°C -3.7°C). It is not currently possible to add reliable error bars to the range of ECS estimates from PlioMIP2. However, as the Tier1 PlioMIP2 experiments with CO2 set to 350ppmv and 450ppmv become available we will be able to provide an indication as to how uncertainties in the KM5c CO2 would affect the PlioMIP2/PRISM4 670 constrained estimates of ECS. In addition, as more orbitally tuned SST data becomes available, it will be necessary to revisit the ECS analysis in order to ensure maximum accuracy.
The emergence of the concept of longer-term sensitivity, ESS, can be at least partly attributed to the study of the Pliocene epoch Haywood et al., 2013a). However, as Hunter et al. (2019) state clearly, the comparison between ECS, PlioCore -PICrtl, and ESS can only be robust if an assumption is made that the PlioMIP2 model boundary conditions are a good 675 approximation to the equilibrated Earth system with enhanced atmospheric CO2 concentration. This may appear to be a reasonable assumption now, since the changes in non-glacial elements of the PRISM4 palaeogeography are limited. Yet, within the bounds of plausible uncertainty, a larger number of additional palaeogeographic modifications remain possible for the Late Pliocene than were incorporated into the PRISM4 reconstruction (see Hill 2015 andDe Schepper et al., 2015), and which may have a bearing on how well the Pliocene is seen to approximate an equilibrated modern Earth system in the years 680 ahead.
PlioMIP1 determined a range in the ESS/CS ratio of between 1.1 and 2.0, with a best estimate of 1.5. In PlioMIP2, which has benefited from the access to a larger array of models and new boundary conditions, the range in and best estimate for the ESS/CS ratio is similar but slightly larger (1.1 to 2.9 and 1.7 respectively). Therefore, new modelling and new constraints on the data for PlioMIP2 suggests a slight increase in estimates of both the ESS/CS ratio and data constrained estimates of ECS 685 between PlioMIP1 and PlioMIP2.

Conclusions
The Pliocene Model Intercomparison Project Phase 2 represents one of the largest ensembles of climate models of different complexities and spatial resolution ever assembled to study a specific interval in Earth history. PlioMIP2 builds on the findings 690 of PlioMIP1 and incorporates state-of-the-art reconstructions of Pliocene boundary conditions and new temporally consistent sea-surface temperature proxy data which underpins the new data/model comparison. The major findings of the work include: • Global annual mean surface air temperatures increase by 1.7 to 5.2°C compared to the pre-industrial, with a multi-model average increase of 3.2°C.
• The mean annual surface air temperature response is larger in PlioMIP2 than in PlioMIP1, mainly due to the addition of 695 new and more sensitive models in PlioMIP2.
• The multi-model mean annual total precipitation rate increases by 7% compared to the pre-industrial, while the modelled range of precipitation increases by between 2% and 13%.
• The multi-model mean anomaly between Pliocene and pre-industrial is statistically robust for surface air temperature and sea surface temperature over most of the globe. The multi-model mean precipitation anomaly is robust at mid-high latitudes 700 and in monsoon regions but is smaller than inter-model standard deviation in many parts of the tropics and subtropics.
• The degree of polar amplification of surface air temperature change is generally consistent with RCP transient climate modelling experiments used to predict future climate, implying that CO2 changes dominate the ice sheet changes in the PlioCore experiments.
• The land warms more than the oceans in a manner akin to future climate change simulations. 705 • As an ensemble, average NH warming does not show a clear seasonal cycle, but a clear seasonal cycle is seen in many individual models.
• The difference in the average warming between the hemispheres is subdued, relative to simulations of 2100 CE climate. This is likely due to the substantial changes to the albedo feedback mechanism in the Southern Hemisphere following the removal of large areas of the Antarctic ice sheet in the mid-Pliocene. 710 • There is a statistically significant relationship between ECS and Pliocene global annual average temperature change. The PlioMIP2 ensemble finds that ESS is greater than ECS by a best estimate of 67%.
• Model estimates of the relationship between ECS and PlioCore -PICrtl, combined with the PlioMIP2 ΔSST, provides a data constrained estimate of ECS with a range of 2.6°C-4.8°C. This is larger than the values suggested from PlioMIP1 (1.9°C -3.7°C). 715 • The PlioMIP2 model ensemble shows broad agreement on polar amplification of the global warming signal and tropical enhancement of rainfall anomalies. Inter-model differences in simulated temperature are mostly found in polar regions and where land-sea-mask and orography of Pliocene paleogeography differ from today.
• The PlioMIP2 ensemble appears to be broadly reconcilable with new temporally specific records of sea surface temperatures. Significant agreement between simulated and reconstructed temperature change is seen, with notable local 720 signals of data/model disagreement. Differences between observed pre-industrial sea surface temperature data sets are large enough to have a significant impact on how well models reproduce proxy-reconstructed ocean temperature changes.

Acknowledgments
We acknowledge the use of NOAA_ERSST_V5 data provided by the NOAA/OAR/ESRL PSD, Boulder, Colorado, USA, 725 from their web site at https://www.esrl.noaa.gov/psd/. AMH, JCT, AMD, SJH and DJH, acknowledge the FP7 Ideas:        Multimodel mean Plio Core -PI Ctrl precipitation anomaly. c) Standard deviation across the models for the Plio Core -PI Ctrl Precipitation anomaly. d) Plio Core -PI Ctrl precipitation anomaly, regions which have at least 80% of the models agreeing on the sign of the change are marked with '/'. Regions where the ratio of the multimodel mean precipitation change to the PI Ctrl intermodel standard devition is greater than 1 are marked with '\'.   Figure 7: a) the globally averaged Plio Core -PI Ctrl SAT anomaly for each model vs the published equilibrium climate sensitivity (crosses) with the line of best fit shown in orange. b) statistical relationships between the latitudinally averaged Plio Core -PI Ctrl SAT anomaly and the published climate sensitivity. The proportion of climate sensitivity that can be explained by the SAT anomaly at each latitude (Rsq) is shown in red, while the probability that there is no correlation between the climate sensitivity and the SAT anomaly (p) is shown in blue. c) colors: the percentage variation in modelled ECS that is linearly related to the modelled Plio Core -PI Ctrl SAT anomaly at each gridsquare (R sq ). Hatching shows a significant relationship (at the 5% confidence level) between the Plio Core -PI Ctrl SAT anomaly at that gridsquare and ECS.  Figure 8: a) PRISM4 -NOAA ERSSTv5 SST anomaly for the datapoints described in section 4. b) multimodel mean Plio Core -PI Ctrl SST anomaly at the points where data are available. c) The difference between the SST anomaly derived from the data (Fig. 8a) and that of the multimodel mean (Fig. 8b) .75 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75 climate sensitivity (deg C) Figure 9: Equlibrium Climate Sensitivity estimated from the data shown in Fig. 8 based on the modelled gridpoint based regressions between Plio Core -PI Ctrl and ECS. This analysis was limited to those data points which were with 1 • C of the range of PlioMIP2 models, and was also limited to those sites which were in a gridbox where the modelling suggested a significant relationship between the Plio Core -PI Ctrl SAT anomaly and the ECS.