Aims: To elucidate differences between commonly-used mid-latitude geomagnetic indices and study quantitatively the differences in their responses to solar forcing as a function of Universal Time (UT), time-of-year (F), and solar-terrestrial activity level. To identify the strengths, weaknesses and applicability of each index and investigate ways to correct for any weaknesses without damaging their strengths. Methods: We model how the location of a geomagnetic observatory influences its sensitivity to solar forcing. This modelling for a single station can then be applied to indices that employ analytic algorithms to combine data from different stations and thereby we derive the patterns of response of the indices as a function of UT, F and activity level. The model allows for effects of solar zenith angle on ionospheric conductivity and of the station’s proximity to the midnight-sector auroral oval: it employs coefficients that are derived iteratively by comparing data from the current aa index stations (Hartland and Canberra) to simultaneous values of the am index, constructed from chains of stations in both hemispheres. This is done separately for eight overlapping bands of activity level, as quantified by the am index. Initial estimates were obtained by assuming the am response is independent of both F and UT and the coefficients so derived were then used to compute a corrected F-UT response pattern for am. This cycle was repeated until it resulted in changes in predicted values that were below the adopted uncertainty level (0.001%). The ideal response pattern of an index would be uniform and linear (i.e., independent of both UT and F and the same at all activity levels). We quantify the response uniformity using the percentage variation at any activity level, V = 100 (σS/〈S〉), where S is the index’s sensitivity at that activity level and σS is the standard deviation of S: both S and σS were computed using the eight UT ranges of the 3-hourly indices and 20 equal-width ranges of F. As an overall metric of index performance, we take an occurrence-weighted mean of V, Vav, over the eight activity-level bins. This metric would ideally be zero and a large value shows that the index compilation is introducing large spurious UT and/or F variations into the data. We also study index performance by comparisons with the SME and SML indices, compiled from a very large number of stations, and with an optimum solar wind “coupling function”, derived from simultaneous interplanetary observations. Results: It is shown that a station’s response patterns depend strongly on the level of geomagnetic activity because at low activity levels the effect of solar zenith angle on ionospheric conductivity dominates over the effect of station proximity to the midnight-sector auroral oval, whereas the converse applies at high activity levels. The metric Vav for the two-station aa index is modelled to be 8.95%, whereas for the multi-station am index it is 0.65%. The ap (and hence Kp) index cannot be analyzed directly this way because its construction employs tabular conversions, but the very low Vav for am allows us to use 〈ap〉/〈am〉 to evaluate the UT-F response patterns for ap. This yields Vav = 11.20% for ap. The same empirical test applied to the classical aa index and the new “homogenous” aa index, aaH (derived from aa using the station sensitivity model), yields Vav of, respectively, 10.62% (i.e., slightly higher than the modelled value) and 5.54%. The ap index value of Vav is shown to be high because it exaggerates the average semi-annual variation and has an annual variation giving a lower average response in northern hemisphere winter. It also contains a strong artefact UT variation. We derive an algorithm for correcting for this uneven response which gives a corrected ap value, apC, for which Vav is reduced to 1.78%. The unevenness of the ap response arises from the dominance of European stations in the network used and the fact that all data are referred to a European station (Niemegk). However, in other contexts, this is a strength of ap, because averaging similar data gives increased sensitivity and more accurate values on annual timescales, for which the UT-F response pattern is averaged out.