Reforming Government. Empowering Patients.
Navigation
Generalization Uncertainty in AI-Enabled Medical Devices
A Safer Way Forward


The Paper
Executive Summary
Artificial intelligence (AI) medical devices often perform well in testing but less reliably when used on real patients whose data differs from the data used to develop the model. Device unreliability presents patient safety risks and can erode clinician trust as well as slow the technology’s adoption. From a policy perspective, unreliability may prompt regulatory responses that inadvertently constrain the benefits AI can bring to health care without solving the underlying problem. Thus, “generalization” is a crucial matter for AI-enabled medical devices.
Generalization is the technical term for an AI medical device’s successful processing of real-world data (RWD) encountered in a medical facility. The result of this processing is called the device’s output, and outputs differ according to device type. An output may be a disease prediction, a categorization (such as a diagnosis based on a medical image), a recommendation of a specific medical intervention, or some other clinical determination.
Generalization, however, is not a foregone conclusion for AI devices. Unlike traditional software systems that employ deterministic rules (e.g., “If x is present then do y, otherwise do z”), AI often uses complex models to predict the most likely output. The effectiveness of a model is closely related to the training data that configured the model’s parameters and, in turn, determines the device’s real-world performance. The central policy challenge is how to respond best to this uncertainty without impairing the field’s ongoing progress or mandating an ineffective remedy.
Generalization uncertainty is doubt regarding a device’s ability to produce accurate outputs. Research has found that many devices approved by the Food and Drug Administration lack robust clinical performance evaluations. Moreover, a 2025 study found that AI medical devices lacking validation were more likely to be recalled.1 These validation gaps, compounded by the “black box” nature of AI, heighten concerns about reliability.
As mentioned at the outset, generalization problems pose a genuine threat to patient safety. Alongside this risk of patient harm, generalization problems may also erode confidence in these devices’ reliability and provoke policy reactions that depress AI device adoption. Inasmuch as AI technology is contributing to life-saving innovations—such as predicting when a patient is at high risk for breast cancer in the next 24 months2—generalization uncertainty could prove detrimental to American health care beyond individual device failures.
A worry related to generalization uncertainty is the possibility of algorithmic bias and, potentially, poor device performance for populations that are not well represented within device training data. However, the remedy for generalization uncertainty transcends adequate demographic representation, because AI medical devices are parameterized by specific features contained within their training data. If, for example, the training data is demographically diverse but the characteristics within individual data samples are very similar to one another, then outlier patients (who significantly deviate from these characteristics) may be at higher risk for device non-generalization despite belonging to the demographic segments adequately represented within the training data.
Among the proposed remedies for generalization uncertainty are third-party device and algorithm certification, training data assessment, and physician evaluation of training data suitability. Among the limitations observed among these options are the risk of high consultative costs, a conflict with future adaptive AI designs, and a failure to personalize the generalization question for an individual patient. High consultative costs are a particular concern, because they could encourage a divide between well-financed health systems that can afford such consultation and rural health systems that cannot.
Instead of mandating any of these options, this paper proposes the development of a voluntary alternative, Digital Similarity Analysis (DSA). DSA will evaluate the similarity and dissimilarity of an individual patient’s medical image to the training and testing data used for the device’s development. The purpose of DSA is to determine if a patient’s medical image is an outlier compared to a device’s training and testing data. This determination would be made prior to the use of that device with the patient image. The physician, when alerted by DSA that a patient’s image is significantly dissimilar to the AI device’s training and testing data, can decide to:
- forgo device use because of the perceived risk for generalization problems,
- require supplemental validation of the medical device’s output for the patient, or
- use the device but treat any clinical determination produced by the device with lower confidence.
Although the DSA proposal would not eliminate generalization uncertainty, it provides a valuable direction for AI medical device safety and avoids alternatives that inadequately address the problem. Further, DSA expands the discussion of algorithmic bias beyond broad demographic categories to the specific characteristics of each patient. By shifting evaluation from population groups to individuals, the DSA approach may enhance safety across demographic segments. Finally, an additional benefit of DSA is its integration of image characteristics that arise from differences in radiology equipment and technician technique, a subject too often ignored in the discussion of generalization uncertainty.
Introduction: AI Generalization and Bias Concerns
Artificial intelligence (AI) in health care has expanded at a rapid pace, advancing through a diverse range of applications that include transcribing physician-patient interactions, automating billing code assignments for patient care, and designing new drugs. With respect to AI in medical devices,3 adoption has faced challenges related to their generalization abilities.
Generalization is the technical term for an AI-enabled medical device’s successful processing of real-world data (RWD) encountered in a medical facility. The result produced from this processing is called the device’s output, and outputs differ according to device type. An output may be a disease prediction, a categorization (such as a diagnosis), a recommendation of a specific medical intervention, or some other clinical determination. Generalization, however, is not a foregone conclusion for AI devices. Unlike traditional software systems that use rules to process RWD (e.g., “If x is present then do y, otherwise do z”), AI typically uses complex models4 to predict the most likely output. The effectiveness of any model is closely related to the training data that configured the model’s parameters5 and, in turn, determined the device’s real-world performance. If these parameters are configured too closely to the model’s training data, then the device will likely have generalization deficiencies and the model inside the device will be described as overfit. A generalization deficiency is a technical way of stating that the output produced by an AI-enabled medical device is unreliable, either inaccurate or inadequate. Underfitting, in contrast to overfitting, occurs when the training data is insufficiently expansive6 to address the wide variety of data scenarios expected in real-world use of the device.7 Both overfitting and underfitting can produce generalization deficiencies—that is, the device’s output is either inaccurate or inadequate.
Generalization uncertainty is doubt regarding an AI device’s ability to produce accurate outputs when using RWD. Generalization problems, regardless of overfitting or underfitting, can pose a genuine threat to patient safety. Alongside this risk of patient harm, generalization problems may also erode confidence in these devices’ reliability and provoke policy reactions that depress AI device adoption.8 Because AI technology is contributing to life-saving innovations, such as predicting when a patient is at high risk for breast cancer in the next 24 months,9 generalization uncertainty (and the AI adoption resistance it engenders) could prove detrimental to American health care beyond individual device failures.
Generalization uncertainty is a growing concern in clinical AI,10 particularly given current deficits in device validations. A study of 903 AI-enabled medical devices approved by the Food and Drug Administration (FDA) found that “at the time of regulatory approval, clinical performance studies were reported for approximately half of the analyzed devices, while one-quarter explicitly stated that no such studies had been conducted.”11 A separate study of a similar number of AI medical devices found that 6.3 percent of the devices were subject to recall actions after FDA approval and that devices lacking either prospective or retrospective validation had more recalls per device.12 The combination of validation gaps and recall data, along with the black box nature of AI devices, feeds generalization uncertainty. This state of affairs has given rise to the observation that “despite the promising potential and broad applications, transparent information regarding key characteristics and outcomes related to the clinical evaluation of these devices at the time of regulatory approval is not well established.”13
Generalization uncertainty risks include algorithmic bias14 and, more specifically, poor device performance for minority populations who are not well represented within device training data.15 This worry is part of a larger concern about bias in medicine. Beyond the discussion of medical AI, health care worker bias in treating minority patients has been a matter of long-standing study. Additionally, traditional non-AI software, while not employing training data, can also manifest performance differences related to demographics. For example, a 2019 study found evidence of racial bias within a traditional software algorithm that used predicted health care costs as a proxy for the need for extra patient care and did not account for the factor that “unequal access to care means that we spend less money caring for Black patients than for White patients.”16
The question of AI bias is valid in and of itself, but it is enmeshed within a larger assumption that has its own problems: adequate representation of a population segment within training data produces AI device effectiveness within that segment. As discussed later in this paper, the reasons why this assumption is flawed illuminate core challenges to AI generalization and provide a compass for improving patient safety with respect to generalization uncertainty.
AI Training Data and Generalization
The majority of FDA-approved AI-enabled medical devices are used for medical image analysis.17 Additional FDA-cleared devices include clinical decision support (CDS) tools that analyze patient‑specific data and output specific recommendations upon which clinicians may rely. Additionally, there are CDS algorithms that are part of electronic health records (EHR) and regulated by the Office of the National Coordinator (ONC)/Assistant Secretary for Technology Policy (ASTP). In the interest of focus and avoiding generalizations too broad to be actionable, this paper limits itself to AI-enabled medical devices involved in medical image analysis and defers discussion of CDS to a future publication.
Though there are a variety of different programming architectures, AI-enabled medical devices often use artificial neural networks (ANN). Convolutional neural networks (CNN), which are a subset of ANN, specialize in visual data analysis and are popular for processing medical images used as device inputs. A neural network’s processing of such inputs is ultimately expressed as a disease prediction, a categorization (such as a diagnosis), a recommendation, or some other clinical determination. Whatever the type, these determinations are generically described as device outputs.
The algorithms within a neural network’s processing tend to be probabilistic. In other words, they calculate the likelihood of a particular clinical determination rather than employing static rules that make the determination only when a set of conditions is satisfied. For example, the Sybil AI device evaluates a single low-dose CT scan (a type of specialized X-ray image) of the lungs and uses deep learning to18 predict a patient’s risk of developing lung cancer up to six years in the future.19 This probabilistic capacity can be life-saving because, in the case of various cancers, later detection results in a higher mortality rate.20
AI’s probabilistic nature is powerful in health care because it is well-suited for complex medical data that may be ambiguous, incomplete, or contain subtle patterns that are not readily detected by physician review. However, AI devices need considerable amounts of information21 to:
- improve device accuracy and maximize the range of possibilities and complexity that the device can successfully accommodate,22
- reduce the risk of model underfitting,
- increase training on infrequent—though medically important—edge case scenarios, and
- quantify output uncertainty.
The information used to achieve these objectives is known as training data. The type of training data varies by device category and includes (but is not limited to) x-rays, chest scans, and mammograms.23 The visual information from such images is entered into the neural network’s input layer (see Figure 1) as numeric representations of the image’s pixels. This data is progressively processed within the neurons between the input and output layers. The final layer, the output layer, indicates the result produced through the calculations of the hidden layers. While the output may be textually, inside the device the output is numeric, the importance of which will be illuminated in the explanation of device training.

The most meaningful dissimilarity between traditional software development and ANN development is how parameters inside the software are set. A parameter is a variable (i.e., a value such as a number or a string of letters) used to control software functions and performance. For example, in a point-of-sale system that processes credit card payments for goods, there is a parameter that sets the sales tax rate for the end user’s state. In such traditional software, the parameters are manually entered into the code by a programmer. In neural-network medical devices,24 parameters are set by algorithms interacting with training data. The parameters in such devices may number in the hundreds of thousands (or more).25 The process of setting parameters is iterative and is described as “training” the AI model within the device. The word model refers to the totality of the system’s algorithms and their respective parameterizations.
For each instance of training data (e.g., a single mammogram), algorithms evaluate the difference between the correct clinical determination (i.e., the output) that was expected for that one instance of training data and the observed determination that the device produced. This difference is expressed numerically, and algorithms then adjust the parameters to minimize the difference between the expected and observed outputs. This process is repeated for every individual instance of training data, and the entire collection of training data may be processed multiple times26 as the device is being trained.
Two key concepts in this process are backpropagation and gradient descent (those uninterested in this process discussion may skip ahead to Section II: Responses to Generalization Uncertainty and Their Limitations). Both concepts are rooted in the minimization of differences between the expected and observed outputs for the training data inputs. The differences between expected and observed outputs are measured numerically,27 and a major challenge in reducing these differences—thereby increasing the accuracy of outputs—is avoiding overfitting the model to the training data and, consequently, undermining the model’s ability to successfully generalize its functionality to new patient data encountered in the real world.
The backpropagation algorithm is a means to calculate the gradient (i.e., the direction and rate of change) of the parameters’ contribution to the observed output’s deviation28 from the expected output. Backpropagation is performed for each neural network layer (see Figure 1), moving in reverse from the output layer toward the input layer. Once complete, a gradient descent algorithm29 then uses the backpropagation’s calculation to adjust the model’s parameters in the correct direction (i.e., the ascent or descent of the gradient) and scope (i.e., how large a numeric change) so that the difference between accurate and actual output is minimized. In other words, it moves in the opposite direction of the gradient and modifies the parameters so that the observed output will more closely resemble the expected output. The parameter updates resulting from these techniques do not guarantee that the device can generalize successfully in every circumstance. Instead, backpropagation and gradient descent will have optimized30 the model’s parameters for the training data, and the testing data’s validation of that parameterization is the basis for assuming that the device can generalize.
Responses to Generalization Uncertainty and Their Limitations
Generalization uncertainty is a major concern in AI health care because a generalization problem can result in patient harm and related litigation. While these possibilities are hardly unique to AI medical devices,31 they can negatively affect the technology’s adoption. Slow adoption, for its part, can endanger the nation’s leadership in medical AI and the public health advantages this leadership can bring.
Lagging adoption impedes AI leadership because valuable knowledge is gained from the technology’s deployments, that is to say, implementations at health facilities. Among the reasons behind these learnings is that AI may not be “plug and play” like traditional software. In these cases, deployments generate valuable insights when the effects of variables (such as clinical protocols) are evaluated for their influence on AI device performance.32 Moreover, aside from the real-world validation of AI functionality, deployments can produce useful data for future AI enhancements based on user interactions with AI devices.33 For example, one study notes that “choices related to user interface design can shape interactions between humans and AI models, introducing new cognitive biases into clinical decision-making.”34 Hence, reduced usage arising from generalization uncertainty would reduce downstream learning that could, in turn, slow the pace of AI upgrades and deployment guidance improvements.
A related concern is that health care AI will become the domain of affluent health systems, with much lower adoption by lower-resourced providers, such as rural health systems and urban hospitals. Further, when deployed in lower-resourced systems, limited in-house AI expertise may lead to safety events that might not be observed in the more resourced systems. Costly correctives for generalization uncertainty inflate the total cost of AI ownership and worsen the “digital divide” between richer health systems that can afford AI devices with mitigations for generalization uncertainty and less-resourced systems that cannot.
There are a variety of broad responses to generalization uncertainty. Among the options35 are:
- third-party device and algorithm certification,
- training data assessment, and
- physician evaluation of training data suitability.
While each of these approaches offers partial solutions, they share important limitations—most notably, limited personalization and compatibility with future adaptive AI systems. These shortcomings point to the need for another alternative.
Third-Party Device Certification
One approach to generalization uncertainty is to have an objective and independent party certify that an AI-enabled medical device will likely generalize in the real world, including for specific patient subgroups.36 Certification would entail experimental verification of device functionality beyond the manufacturer’s own testing data and would be performed independently of a specific implementation at a health system.37 Additionally, verification criteria could be standardized for those devices performing the same function, which would make the meaning of certification clear for that device category.
This straightforward proposal must contend with several challenges. First, certification assumes that the independent party has access to the specific RWD needed for each category of AI-enabled medical device as well as the breadth of RWD needed to confirm generalization. The latter suggests the need for considerable RWD resources, because certification is being performed for the overall population with all its inherent diversity. Second, the approach lacks personalization—that is to say, the framing of a device’s reliability relative to an individual patient’s medical imaging or other medical information.
Third, the RWD must not only cover diverse use cases—like different patient illnesses and health states—but also image differences arising from different models and versions of the same imaging equipment. Different machines can produce different colors, contrast, and resolution for the same patient. These factors, as variables affecting device parameterization, could affect the AI devices’ image interpretation. Fourth, in addition to the need for domain-specific data, domain-specific clinical knowledge is required for labeling the correct RWD interpretation. Correct RWD interpretations must be produced before expected data interpretation can be compared with the device’s observed outputs. Certification is further burdened by the certifier’s obligation to keep current with the continual advancements across all AI device categories.38
Given the costs of data acquisition and expert labor, the price of certification may be too expensive for less-resourced health systems such as those located in rural areas. Additionally, certification assumes a particular historical snapshot, which would limit the certification’s value when the FDA begins approving adaptive AI medical devices someday in the future. Adaptive AI refers to AI models built to improve by adjusting parameters through continuous learning after deployment.39 The problem with a certification of a device during a single historical review extends beyond adaptive AI to locked algorithm devices that receive parameter updates as part of FDA-approved software updates.
Training Data Assessment
A more modest approach to the uncertainty problem is an assessment limited to a device’s training data.40 As with device certification, a training data assessment could be performed by an independent company or consultant. Characteristics of the device’s dataset, such as size and demographic representativeness, could function as measures of its training data quality and suggest the device’s potential to generalize.41 Representativeness, in particular, is
concerned with the extent to which the dataset represents the targeted population (such as patients) for which the application is intended. Whether the population of the dataset covers a sufficient range in terms of age, sex, race or other background information is the topic of the subdimension variety in demographics contained within the dimension variety. This dimension also contains the subdimension variety of data sources concerned with questions such as: Does the data originate from a single site? Were the measurements done with devices from the same or different manufacturers? Appropriately investigating such questions can provide a strong indication for the applicability and generalisability of the [machine learning] application in different environments.42
As an alternative to certification, training data assessment eliminates the need for RWD collection and the associated data labeling. This, by extension, reduces some of the cost to mitigate generalization uncertainty.
Although this approach’s logic is intuitive, it does not escape several of the difficulties already noted for certification. First and foremost, while desirable, a demographically representative dataset does not guarantee that the device will generalize to the dataset’s represented population segments. If, for example, the training data is demographically diverse but the visual characteristics within individual data samples are very similar to one another, then outlier patients (whose medical images significantly deviate from these characteristics) may be at higher risk for device non-generalization despite belonging to adequately represented demographic segments.
This predicament reflects the nature of how a neural network extracts features from an image.43 This process starts with the image’s pixels expressed as a matrix of numeric values. If, for example, the image was 500 x 500 pixels, then the matrix would have 250,000 numeric values representing the sum of all pixels (500 rows multiplied by 500 columns). These numeric values represent each pixel’s color. Grayscale images have a single matrix of numbers for analysis, while color images have multiple related matrices for each pixel’s color components (e.g., red, green, and blue values). A separate and very small matrix, known as a kernel or filter, moves across an image matrix using small, orderly movements. The kernel has its own values within its matrix, and these values, when applied to the specific image patch, help extract simple features (see Figure 2). At the most basic level, these extracted features may concern edges, patterns, textures, certain color values, and so on.44 When the extracted features are analyzed in combination with one another, they can detect more complex structures and conditions within the image.

As referenced in the prior certification discussion, medical image attributes extend beyond a patient’s anatomical characteristics. Differing imaging equipment, or generations of the same machine, can affect the resulting medical image and influence parameterization on the development side and generalization on the deployment side. There is also the matter of technician effects on medical images. For example, a sonographer performing an echocardiogram generates a digital record of heart behavior (e.g., wall movement, valve regurgitation, etc.). The sonographer’s choices regarding the plane of the ultrasound beam during the echocardiogram affect important measurements such as blood flow velocity and heart wall thickness.
Finally, not being an ongoing activity, a training data assessment is also ill-suited for future adaptive AI devices that change their parameters over time through continuous learning (though an analogous issue applies to nonadaptive systems where the device parameters are modified through a software update).
Physician Evaluation of Training Data Suitability
A third possible response to generalization uncertainty is physician evaluation of a device’s training data before the device is used with an individual patient. A physician, using a description provided by the manufacturer, would evaluate the degree to which a device’s training data encompasses a patient’s demographic attributes and then would decide if the device is appropriate for use with that patient.
Physician evaluation carries with it all the challenges of training data assessment while introducing an individual patient’s demographic characteristics in relation to the training data. This approach assumes a physician will consistently review a device’s training data description for each patient, which is open to debate given the frequency of physician neglect of drug labels.45 This approach also assumes the patient will undergo medical imaging at the physician’s health system where the physician would presumably have access to the device’s training data description (though the manufacturer may offer a training data description publicly in the form of an “applied model card”).46
A related concern is how much information a manufacturer will publicly disclose about a device’s training data.47 Training data, as the foundation of device performance, represents a valuable intellectual property asset. Consequently, a manufacturer may craft its data description around competitive considerations. Additionally, some manufacturers may withhold training data descriptions because critics and competitors might use them to shame the manufacturer publicly about insufficient demographic diversity.
A Better Response to Generalization Uncertainty
The shortcomings in generalization uncertainty responses provide valuable context for constructing an alternative. Ideally, an alternative would:
- personalize generalization uncertainty from the perspective of individual patients;
- reduce or eliminate the need for highly specialized and expensive data science consultants;
- avoid dependence on RWD, whether collected or purchased;
- accommodate differences among images arising from the use of different machines or imaging technicians;
- empower physicians to make more informed patient decisions about AI-enabled medical device use without requiring manual analysis;
- be compatible with either static or adaptive datasets (even though current FDA-approved medical devices have static datasets); and
- preserve training data confidentiality and its control by the device manufacturer.
This section proposes a generalization uncertainty response described as Digital Similarity Analysis (DSA). Unlike existing approaches, the DSA proposal does not rely on broad population-level validation or costly third-party certification. Instead, it provides a patient-specific assessment of whether an AI device is likely to generalize effectively. The final section of the paper reviews some implications of generalization uncertainty from a policy perspective.
Rethinking the Training Data Question
A central premise related to generalization uncertainty is that training data, as the basis for device parameterization, greatly influences generalization results. As explained earlier, the difference between the expected output for each training data instance and the observed output is minimized by iteratively modifying the device’s parameters, thus improving its accuracy. When training is fully completed, the device is validated on a smaller set of testing data.
To avoid underfitting (and compromising generalization), manufacturers employ an assortment of tactics. One of the most basic is training the device on a large and informationally diverse dataset. Large datasets, it is hoped, will provide sufficient pattern variations that, once reflected in the parameter settings, will enable the device to accommodate more RWD scenarios than would be the case with a narrow dataset. In the case of AI-enabled medical devices, the dataset is typically patient medical images whose correct interpretations (i.e., the expected output) are known prior to the training of the AI system.
However, even when training data is varied and demographically representative, generalization is not guaranteed because device performance depends on underlying image characteristics and their extraction. If a medical device’s training data is highly dissimilar to a patient’s own medical image, there are reasonable questions48 about whether this device will generalize for this individual. These doubts arise because the device’s parameters are adjusted by the visual characteristics of the training data,49 and different visual characteristics can result in different parameter modifications. Consequently, reviewing a written demographic summary of training data does not diminish generalization uncertainty, nor does a prior validation test that had not anticipated this patient’s deviation from the training dataset. In both cases, there remains an unconsidered—and potentially dangerous—mismatch between the characteristics of the patient image and those represented in the device’s training data.
Instead of accepting this state of affairs and its accompanying AI safety and adoption concerns, we can mitigate some degree of risk through “selective deployment” informed by the similarity between the patient’s medical image and the device’s training data.50
Selective Deployment
In their paper “The Selective Deployment of AI in Healthcare,”51 Robert Vandersluis and Julian Savulescu consider the underrepresentation of groups within training data. The authors use the phrase selective deployment to describe the use of AI with those populations for whom the AI performs well. The first case study in support of this approach was a prognostic52 algorithm for breast cancer, and the second, a skin disease algorithm. In the case of the breast cancer algorithm, the training data was exclusively female given data collection challenges related to women suffering from breast cancer much more frequently than men.53 In the case of the skin disease algorithm, there was a different disparity in training data representation. Light-skinned people were overrepresented compared to the population as a whole, as melanoma is much more prevalent among people with lighter skin.54
Vandersluis and Savulescu argue that instead of deploying the algorithm for general use with any patient, female or male, with breast cancer, it is better to use the algorithm with “patient subgroups where the model performs well, while withholding the algorithm from those patient subgroups for whom the model is expected to perform poorly (or unpredictably).”55 They state:
We believe that the ethical tensions arising from the delayed and expedited deployment options are sometimes best resolved through a selective deployment approach; this approach is not ideal, but instead—with appropriate regulation—represents the best way to balance harm prevention, utility and fairness considerations.56
Selective deployment for the skin disease algorithm, on the other hand, was not as clear for the authors, as this algorithm’s higher incidence of false positives among darker-skinned people would lead to higher rates of specialist care for that population.
The authors’ advocacy of selective deployment does not preclude bias mitigation efforts, such as increasing population diversity in data collection and clinical research.57 Rather, the approach seeks to balance lowering patient harm risk in some groups with the promise of medical benefits for others. Selective deployment seeks to reduce both direct and indirect harm by approving AI tool access for populations where indications suggest that use is appropriate, while discouraging use where it is not.
However, if a selective deployment decision is limited to the training data’s demographic composition, in some circumstances AI device benefits may be needlessly withheld from groups despite the intention to prevent patient harm. The underrepresentation of a patient’s demographics in training data, though undesirable, does not confirm that an AI device will fail to generalize for that patient—just as adequate representation of a patient’s demographics does not guarantee successful generalization. A preferable framework would personalize a selective deployment decision based on characteristics related to an individual patient. This premise is the foundation of the Digital Similarity Analysis proposal.
Digital Similarity Analysis
DSA proposes to evaluate the resemblance (similarity/dissimilarity) between an individual patient’s medical image58 and the domain of medical images comprising a device’s training data and testing59 data. The purpose of DSA is to determine if a patient’s medical image is an outlier compared to this data prior to the use of that device with the patient image. The physician, when alerted by DSA that a patient’s image is significantly dissimilar to the AI device’s training and testing data, can make a selective deployment decision to (a) forgo device use, (b) require supplemental validation of the medical device’s output for the patient, or (c) use the device but treat any clinical determination produced by the device with lower confidence. It is important to note that an outlier judgment by the DSA approach does not predict a generalization failure; rather, it highlights when an image’s attributes were not well represented in either the AI model’s parameterization (training) or validation (testing).
The DSA hypothesis begins with the comparison of an individual patient’s medical image to the training and testing data used to develop an AI medical device. This comparison does not seek identical matches, because an AI medical device is not an index of images. CNN techniques,60 such as feature extraction and pooling,61 produce an abstraction of a patient medical image that is then processed through the remaining layers of the neural network. Accordingly, the comparison determines whether the patient image belongs to the same domain (i.e., family of characteristics) as the images used for device training and testing.
Determining visual similarity, while intuitive for humans, is a sophisticated operation for computers. A host of variables complicate the process of distinguishing an image feature from its larger context. Figures 3 and 4, while simplistic, illuminate some basic contours of the challenge. Starting with Figure 3, two identical representations of the letter A are displayed side-by-side using the same color and pixel pattern. However, in the second, the letter A is shadowed and requires a computer to distinguish the letter shape from both the background’s lighter contrast pixels and the darker contrast black pixels of the shadow.
Figure 4 illustrates how, even in the absence of contrast complexities (for letters sharing the same identity, color, and typographical pattern), the additional factors of size and rotation present interpretation issues that a computer must navigate. Pattern issues such as line width, edge orientation, and the distance between the character’s crossbar and apex differ among each letter A.

Fortunately, computer science has produced a wealth of image analysis algorithms that can be employed for DSA alongside novel algorithms for unaddressed tasks. Some existing image analysis algorithms relevant to this activity are:
- Structural Similarity Index Measure (SSIM): Originally developed for the quality assessment of an image relative to its source (such as after lossy image compression), SSIM can assess similarity by comparing attributes such as contrast and luminance (i.e., brightness).
- Cosine Similarity (CS): CS calculates the similarity between two vectors (e.g., linear patterns) detected within an image and may be used within the larger process of analyzing the structures within an image regardless of scale and rotation.
- Scale-Invariant Feature Transform (SIFT): SIFT identifies features in a source image and evaluates comparison images for the presence of these features regardless of size and rotation.
- Oriented FAST and Rotated BRIEF (ORB): ORB, like SIFT, identifies image features but ORB is optimized for computational speed.
- Fréchet Radiomic Distance (FRD): FRD is an algorithm created for radiology images and performs a variety of tasks including detecting the main features that differ between image sets and identifying out-of-domain characteristics.62
- Speeded-Up Robust Features (SURF): SURF63 employs a fast computation process of feature extraction, description, and matching.64
Discriminators within generative adversarial networks (GANs) may also prove helpful for DSA development. A GAN uses a software “discriminator” to determine whether an AI-generated image is sufficiently similar to confirmed examples within the image class being mimicked (i.e., generated by the GAN for an end user). Beyond GANs, there are other pretrained networks commercially marketed for image analysis.65
While not solving every challenge of image similarity analysis in AI-enabled medical devices, existing resources provide a strong foundation from which the DSA hypothesis can be explored. Moreover, preexisting algorithms for image analysis have been validated in tasks such as facial recognition, categorizing objects within a photo, retrieving similar images from repositories, recognizing duplicate images, and detecting image degradation after file compression. The algorithm list itself does not exhaust the many options that can assist in determining similarity between an individual patient’s medical image and a device’s training and testing images. DSA’s similarity judgment should be a composite score involving multiple dimensions:
- The patient image’s similarity to the total collection of training data
- The percentage of files within the training data that have a high similarity to the patient image
- The patient image’s similarity to the total collection of testing data used for device validation
- The highest similarity score between the patient image and any one instance of testing data
Testing-data similarity should be given even more weight within similarity scoring than training-data similarity, because the testing data has been used to validate the performance of the AI medical device. Training data, while used to adjust the model’s parameters, has some opaque dependencies. When backpropagation is performed to minimize the difference between a training example’s expected and observed outputs, the magnitude of the parameter adjustments is constrained by the learning rate as well as the number of training epochs.66 Neither variable is known outside the manufacturer who developed the device.
DSA avoids replicating each medical device’s custom set of kernels so that its feature extraction is identical to that of the medical device. Such replication would not only require a device manufacturer to divulge highly confidential trade secrets, but it would also make the DSA process costly and highly customized. Instead, DSA recognizes that a device’s selection and representation of features is derivative of more basic visual attributes present in the raw image input prior to feature extraction. DSA would operate at an intermediate level of image comparison that, while less detailed than the highly specialized kernels of the medical device, would still discriminate significant features upon which meaningful similarities and dissimilarities can be established. The operationalization of this comparison, however, requires empirical testing to establish dissimilarity thresholds that justify selective deployment suggestions.
DSA does not require training and testing data to be either publicly disclosed or publicly rated. In theory, the DSA process could be implemented either locally at the device manufacturer67 or remotely within a private and HIPAA-compliant cloud environment provided by a DSA vendor. In either scenario, DSA would be made accessible to medical facilities that own DSA-supported devices. These facilities would securely upload patient images via the internet for comparative evaluation. The manufacturers themselves would have an incentive to provide DSA to providers, because DSA:
- supplies patient-level evaluation that lowers a medical device’s perceived generalization risks in the eyes of health systems considering its purchase,
- avoids certification or assessment consulting expenses that could significantly increase a device’s total cost of ownership,
- increases transparency around training images without compromising their confidentiality, and
- demonstrates a good faith effort to mitigate patient safety issues associated with generalization uncertainty, not only for groups underrepresented in the device’s training data but for patients overall.
An important dimension of the DSA proposal is the capacity for the tool to be syndicated, that is to say, offered as standardized software as opposed to a custom project for each facility that desires it. Syndication could reduce implementation costs for DSA as compared to a highly custom/consultative offering. Syndication allows development expenses to be spread out across clients and lower the marginal costs of adding new clients. Thus, syndication would ideally make the tool more financially accessible for rural health systems and under-resourced urban facilities.
Comparison of Competing Responses to Generalization Uncertainty
When contrasted against the previously discussed responses to generalization uncertainty, DSA has multiple advantages as demonstrated in Table 1.

A last matter not addressed in Table 1 are the challenges surrounding image consistency and quality from one imaging device to another. The value of a device certification and training data assessment is problematized by RWD images that differ in color, contrast, brightness, size, resolution, and other qualitative differences that are not related to the patient. Such differences could materially interfere with a medical device’s generalization, as these equipment-related differences might impair a neural network’s perception of the patterns it was designed to detect.68 DSA may prove especially helpful in this context, as these equipment-related differences would be evident within the image comparison process.
Policy Implications
The DSA proposal could make a valuable contribution to AI medical device safety and avoid alternatives that inadequately address the problem. Further, DSA extends the discussion of algorithmic bias from broad demographic categories to individual patient characteristics. The proposal also has the additional benefit of addressing image variation arising from differences in radiology equipment and technician techniques.
By improving the grounds for AI medical device adoption, DSA aligns well with the nation’s larger AI health policy aspirations, such as “reduce[d] barriers to the use of AI technologies to promote their innovative application.”69 Moreover, DSA use does not preclude other complementary safety efforts. Notable in this context is targeted postmarket surveillance for AI-enabled medical devices at risk for future unpredictability,70 as well as efforts to predict generalization.
From a policy perspective, DSA does not need major legislative or regulatory interventions or government grants. Freedom from these entanglements insulates the DSA proposal from the debates that new rules attract as well as accusations of government favoritism. Additionally, as a tool for evaluating medical device suitability—and not delivering the medical functionality itself—DSA does not require FDA approval or a regulatory sandbox.71 Leveraging preexisting image analysis algorithms could reduce both the tool’s development cost and time to market.
With respect to the health care market, DSA use would be voluntary, and its success is not dependent on any mandates around training data transparency,72 sourcing,73 or composition requirements.74 This distinguishes DSA from alternatives that rely on mandates. The State of California, for example, passed the Generative Artificial Intelligence: Training Data Transparency Act.75 This law requires generative AI systems to publicly disclose a variety of characteristics of the data that was used to train these systems. Washington State has proposed a similar bill.76
While free of government dependencies, the validation of DSA could be assisted (post-development) by a pilot project within the Veterans Health Administration (VHA). As noted in the paper “Could the VA Be the Key to Lowering the Cost of American Health Care?”,77 the VHA provides numerous assets with regard to AI. The VHA has the scale and geographic breadth to provide a source of RWD that represents diverse populations and health trends that could robustly test DSA’s effectiveness. With regard to this breadth, the agency serves over 9.1 million veterans and operates 1,380 health care facilities across the United States.78 Additionally, the VHA, being part of the Department of Veterans Affairs, comes under the agency’s “Fourth Mission,” a directive that includes the support of public health efforts.79
Lastly, should the DSA proposal prove successful, the learnings could be extended to analogous AI challenges related to the interpretation of non-image inputs (e.g., EKGs), serial images in video, or devices analyzing sound patterns (e.g., AI-enabled stethoscopes).
Appendix: List of Selected AI Acronyms
ANN: Artificial Neural Network. An ANN is a form of machine learning that uses layers of artificial neurons (called nodes) to process data entered into the ANN. Each node within a layer is an individual software module that, as part of its collaboration on a shared ANN task, processes the data passed into it and then provides the result to neurons in the next layer of the ANN.
CNN: Convolutional Neural Network. A CNN is a type of ANN that is optimized for the analysis of visual data within an image. A CNN extracts various features within a medical image in order to perform a disease prediction, diagnosis, recommendation, or other clinical determination.
DSA: Digital Similarity Analysis. An approach to AI-enabled medical device safety that evaluates the similarity/dissimilarity of a patient medical image to the data used in a specific medical device’s parameterization (training) and validation (testing) processes. If the patient image is significantly dissimilar to the device data, the clinician is alerted that the patient image represents an outlier condition and the clinician should consider (a) forgoing device use, (b) requiring supplemental validation of the medical device’s output for the patient, or (c) using the device but treating any clinical determination produced by the device with lower confidence.
GAN: Generative Adversarial Network. A GAN is a form of AI where original content (text, speech, images, etc.) is produced through the interactions of two ANNs: a generator and a discriminator. The generator generates novel content, and the discriminator evaluates the content. The discriminator is adversarial because it compares the new content to examples of content in the same category that are already confirmed to be authentic. If the discriminator can distinguish the new content from the confirmed content, the generator must improve the content through subsequent iterations until the discriminator cannot reliably differentiate between the new content and the examples.
RWD: Real World Data. Patient data sourced from the medical facilities at which AI-enabled medical devices are deployed. RWD is separate from an AI device’s training and testing data.
VHA: Veterans Health Administration. The VHA delivers health care services to American veterans.
