“High-quality health data research and the development of ML [machine learning] models requires meaningful data at a sufficient scale. Such data undoubtedly exist. Most health institutions hold clinical imaging data at a scale ranging from tens of thousands to tens of millions of scans. However, these data are often inaccessible to researchers, even where there is an intention to make them available for research, because of barriers of access and usability. Barriers of access can include: governance barriers (difficulties in understanding and working through governance frameworks regulating data usage); cost barriers (there can be considerable overhead costs to datasets and many datasets require payment for access); and time barriers (dataset requests and curation might incur a considerable time lag before they can be made available). Barriers of usability include: data format barriers (the data might not be in a computationally tractable form); data quality barriers (the data might be of insufficient or uncertain quality); and image labelling barriers (most imaging projects depend on the accurate labelling of those images, which might not be undertaken as part of routine care and are difficult to do retrospectively. To bypass these barriers, many research groups resort to using publicly available imaging datasets. This alternative route often leads to the same datasets within a clinical area being used by many research groups.
[..] This Review aims to identify all publicly available ophthalmological imaging datasets, to create a central directory of what is available for access currently. We report the source of each dataset, their accessibility, and a summary of the populations, diseases, and imaging types represented.
[..] Of the 140 unique datasets, only 94 were open access from which the raw data could be downloaded. 27 datasets were categorised as open access with barriers, from which data could not be downloaded. 19 datasets had regulated access (12 requiring licensing agreements, six requiring an ethical committee or institutional approval, and one requiring a payment of £2250 plus value-added tax).
[..] Of the 94 open access datasets, we found 25 to be from within Asia (four from south Asia and 21 from southeast Asia or east Asia), nine from North Africa and the Middle East, 34 from Europe, 16 from North America, two from South America, and one from sub-Saharan Africa. [..] The country of origin was unknown in 13 datasets. Dataset inception was reported by 47 datasets and ranged from 2003 to 2019.
[..] Where reported, the most common reason for image acquisition was for a research study or a clinical trial (54 of 94; 57%), and for routine clinical care or screening (23 of 94; 24%). Five of 94 (5%) datasets included images acquired from primary care (including screening programmes), 45 of 94 (48%) were from secondary care (hospital or eye clinics), 18 of 94 (19%) were collected in other settings (such as from a university, research settings, or eye banks), and one of 94 (1%) from a non-health-care setting. The setting was unreported in 25 of 94 (27%) datasets. Only 20 of 94 (21%) datasets gave information on whether patient consent was sought and 26 of 94 (28%) datasets stated details about obtaining ethical approval for obtaining or sharing the images.
[..] Although technical details relating to the image files and their acquisition were well reported, any associated clinical information was not. The following information was consistently reported across all datasets: imaging modality (100%), number of images (100%), image format (100%), country of origin (86%), device name and manufacturer (85%), and ophthalmological disease (82%). Patient characteristics (including age, sex, and ethnicity) were particularly under-reported (these factors were reported in <20% of datasets), with 74% of the datasets not reporting any patient demographic data, even at the aggregate level. The inclusion and exclusion criteria were described for only 15% of the datasets and the data collection period was reported for only 19% of the datasets.
[..] Across all datasets, fundus retinal photography was the most common imaging type (54 of 94 datasets), probably because of its widespread availability and common use across a wide range of ophthalmological diseases. The second most common imaging modality was OCT and OCT angiography (18 of 94 datasets, where 9 contained 3 dimensional OCT data). Preservation of the 3 dimensional volume data is advantageous as they give contextual information from neighbouring B-scans, allowing ML algorithms to learn key structural information that might enhance its performance.
[..] For datasets with image labels (such as diagnostic or feature labels), the labelling processes were also poorly defined. Many assumptions are made during the labelling of ground truths, and therefore assurance regarding the label accuracy are paramount since they carry implications for any ML model trained with the use of these labels. Details about the labellers’ amount of expertise, the consensus process used for multiple labellers, and how discrepancies were resolved are therefore all relevant. In the few datasets that reported this information, labellers ranged from medical students to specialist ophthalmologists, but in most cases the skills of the labellers were unknown. Although the detailed labelling of public datasets might be ambitious, a checklist of minimum reporting metadata items could drastically improve the usefulness of the data and could also potentially enable merging across multiple datasets.
[..] The first implication is accessibility. It is encouraging that our Review identified 94 datasets that were potentially open access, but discoverability appears to be an issue. Although a few datasets are well known, many are not, which might lead to lost research opportunities and might result in bias because of an overuse of a few potentially non-representative datasets. There is value in having an online catalogue of such datasets, which would improve their visibility and provide some key metadata that would enable researchers to identify the most suitable dataset for their research question.
[..] The second important implication is the transparency and reporting of a dataset. The value of a dataset is associated with far more than just its size, and our Review has highlighted many factors that would be key considerations for a user. There are, of course, advantages to scale, for example in the development of deep learning models or when seeking to detect a modest signal in a heterogeneous population, but the usability of the dataset will also be associated with the quality, depth, and representativeness of the data. [..] Without key information about the population and disease, it is impossible to make assumptions on how generalisable the data are for a real world setting. Previous work outside of the field of health data, such as Datasheets for Datasets (a concept derived from the electronics industry), have previously highlighted many of the issues raised in this Review, which are prevalent across disciplines. Gebru and colleagues have proposed the reporting of considerations that can improve the transparency and accountability of datasets.
However, there are recognised challenges associated with providing richly labelled data. The curation of metadata items is demanding, costly, and requires careful consideration to ensure accuracy and completeness. The excessive inclusion of detailed metadata could also increase the chance of the reidentification of data items and pose additional privacy concerns. Therefore curation, storage, and access all require thoughtful ethical oversight. However, these risks should be balanced with the potential harm implicated by widespread use of biased and clinically unusable data. Additionally, the risk of reidentification can be mitigated with adherence to widely adopted guidelines for the sharing of raw clinical trial data. The investment of time, skill, and money would generate substantial value in the data and its associated labels, therefore such a dataset is unlikely to be freely available.
The last key implication is around ensuring adequate representation of the population by such datasets. A major concern is the possibility of the underrepresentation of specific groups within public and other datasets, posing unknown biases towards some populations or disease groups. An ML algorithm developed exclusively on one population group might translate poorly beyond that population. If an ML algorithm runs poorly on unseen data that are inadequately described, it is difficult to establish whether the poor performance is attributable to spectrum bias. Knowledge of the populations represented is therefore important for the development of ML algorithms and even more so for their evaluation. This is a key consideration from a global perspective, as countries wishing to develop applications where there is no infrastructure to curate imaging datasets might also be most likely to access publicly available resources as a first option.
[..] For 2015, these three conditions together were estimated to account for 15% of global blindness and 5% of moderate and severe vision impairment, in contrast with the other priority diseases such as cataracts (four datasets), trachoma (one dataset), and refractive errors (three datasets), which contribute to 53% of blindness and 79% of moderate and severe vision impairment. This mismatch might be attributed to many factors, including the relative importance of imaging in the management of the disease, the presence of well developed screening programmes for the most represented diseases (such as diabetic retinopathy) and funding available for specific research areas. Diabetic retinopathy, glaucoma, and age-related macular degeneration are more frequently imaged as part of standard care, as opposed to cataracts, trachoma, and refractive errors. If potential imaging-based solutions could improve the care of patients with cataracts, trachoma, and refractive errors by a non-specialist workforce with the use of task sharing, then perhaps a targeted global effort is required to prioritise the curation and development of imaging in these disease areas.”
Full article, Khan SM, Liu S, Nath S et al. Lancet Digital Health 2020.10.1