“Our study was designed to avoid some common pitfalls in previous studies of both machine learning and deep learning. Although a case-control design is helpful to show the added value of machine learning or deep learning, the study sample from such a design would not be representative of the general screening population and, therefore, its predictor could not be directly applied to clinical practice. Most studies of machine learning and deep learning did not mask the validation sample’s cancer outcomes. This drawback could result in multiple attempts of fitting spurious associations that could lead to an artificially crafted highly accurate prediction. Many published studies of machine learning and deep learning did not differentiate cancers diagnosed at different timepoints in both prediction algorithm development and its AUC assessment. Our study outcomes were derived from survival analysis that takes an individual’s length of follow-up into consideration. This approach not only reduces bias from different durations of follow-up for participants but also associates higher DeepLR [the authors’ deep learning prediction algorithm] scores with earlier lung cancer diagnosis. Early censored individuals were not treated as non-cancers in later years because they could also develop cancers later. This difference between our approach and most existing methods is important.

By contrast with currently available malignancy risk prediction methods or guidelines that are nodule-based, which quantify an individual’s malignancy risk using the largest dominant nodule, DeepLR takes into account aggregate changes in nodule characteristics and non-nodule features from the same individual. Examining potential interactions between different nodules is important, because more than a third of individuals in the training cohort and half of individuals in the validation cohort had at least two non-calcified nodules of 4 mm or larger on their S2 scans. Basing guidelines on the largest nodule can be problematic; in PanCan, 20% of malignant disease arose from non-dominant nodules. Although Ardila and colleagues’ work has incorporated image features from multiple nodules, prediction was restricted to 1-year cancer risk and was not assessed from the survival analysis. An additional strength of our study is the ability of DeepLR to recognise patterns in both temporal and spatial changes, including synergy among changes from different nodules. DeepLR can be used to provide guidance to clinicians to personalise the repeat screening interval and to ascertain the urgency for diagnostic investigations to rule out lung cancer in a manner that is not currently available in existing clinical practice guidelines.”

Prediction of lung cancer risk at follow-up screening with low-dose CT: a training and validation study of a deep learning method (2019.10.17)

“Proponents of risk assessment tools argue that they make the criminal legal system more fair. They replace judges’ intuition and bias—in particular, racial bias—with a seemingly more “objective” evaluation. They also can replace the practice of posting bail in the US, which requires defendants to pay a sum of money for their release. Bail discriminates against poor Americans and disproportionately affects black defendants, who are overrepresented in the criminal legal system.

As required by law, COMPAS doesn’t include race in calculating its risk scores. In 2016, however, a ProPublica investigation argued that the tool was still biased against blacks. ProPublica found that among defendants who were never rearrested, black defendants were twice as likely as white ones to have been labeled high-risk by COMPAS.

[..] we always jail some defendants who don’t get rearrested (empty dots to the right of the threshold) and release some defendants who do get rearrested (filled dots to the left of threshold). This is a trade-off that our criminal legal system has always dealt with, and it’s no different when we use an algorithm.

[..] You can already see two problems with using an algorithm like COMPAS. The first is that better prediction can always help reduce error rates across the board, but it can never eliminate them entirely. No matter how much data we collect, two people who look the same to the algorithm can always end up making different choices. 

The second problem is that even if you follow COMPAS’s recommendations consistently, someone—a human—has to first decide where the “high risk” threshold should lie, whether by using Blackstone’s ratio or something else. That depends on all kinds of considerations—political, economic, and social.

Now we’ll come to a third problem. This is where our explorations of fairness start to get interesting. How do the error rates compare across different groups? Are there certain types of people who are more likely to get needlessly detained?

[..] We gave you two definitions of fairness: keep the error rates comparable between groups, and treat people with the same risk scores in the same way. Both of these definitions are totally defensible! But satisfying both at the same time is impossible.

[..] This strange conflict of fairness definitions isn’t just limited to risk assessment algorithms in the criminal legal system. The same sorts of paradoxes hold true for credit scoring, insurance, and hiring algorithms. In any context where an automated decision-making system must allocate resources or punishments among multiple groups that have different outcomes, different definitions of fairness will inevitably turn out to be mutually exclusive.

There is no algorithm that can fix this; this isn’t even an algorithmic problem, really. Human judges are currently making the same sorts of forced trade-offs—and have done so throughout history.

But here’s what an algorithm has changed. Though judges may not always be transparent about how they choose between different notions of fairness, people can contest their decisions. In contrast, COMPAS, which is made by the private company Northpointe, is a trade secret that cannot be publicly reviewed or interrogated. Defendants can no longer question its outcomes, and government agencies lose the ability to scrutinize the decision-making process. There is no more public accountability. [..]

[University of California law professor Andrew] Selbst recommends proceeding with caution: “Whenever you turn philosophical notions of fairness into mathematical expressions, they lose their nuance, their flexibility, their malleability,” he says. “That’s not to say that some of the efficiencies of doing so won’t eventually be worthwhile. I just have my doubts.”

Can you make AI fairer than a judge? Play our courtroom algorithm game – MIT Technology Review (2019.10.17)