Reading medical data

The diabetic retinopathy challenge, despite the challenges, is not as complicated as it gets. The actual images were provided in JPEG format, but most medical data is not in JPEG format. They are usually within container formats such as DICOM. DICOM stands for Digital Imaging and Communications in Medicine and has a number of versions and variations. It contains the medical image, but also header data. The header data often includes general demographic and study data, but it can contain dozens of other custom fields. If you are lucky, it will also contain a diagnosis, which you can use as a label.

DICOM data adds another step to the pipeline we discussed earlier because we now need to read the DICOM file, extract the header (and hopefully class/label data), and extract the underlying image. DICOM is not as easy to work with as JPEG or PNG, but it is not too difficult. It will require some extra packages.

Since we're writing almost everything in Python, let's use a Python library for DICOM processing. The most popular is pydicom, which is available at https://github.com/darcymason/pydicom.

The documentation is available at https://pydicom.readthedocs.io/en/stable/getting_started.html.

It should be noted that the pip installation is currently broken, so it must be cloned from the source repository and installed via the setup script before it can be used.

A quick excerpt from the documentation will help set the stage for understanding how to work with DICOM files:

>>> import dicom 
>>> plan = dicom.read_file("rtplan.dcm") 
>>> plan.PatientName 
'Last^First^mid^pre' 
>>> plan.dir("setup")    # get a list of tags with "setup" somewhere in the name 
['PatientSetupSequence'] 
>>> plan.PatientSetupSequence[0] 
(0018, 5100) Patient Position                    CS: 'HFS' 
(300a, 0182) Patient Setup Number                IS: '1' 
(300a, 01b2) Setup Technique Description         ST: ''

This may seem a bit messy, but this is the type of interaction you should expect when working with medical data. Worse, each vendor often places the same data, even basic data, into slightly different tags. The typical industry practice is to simply look around! We do that by dumping the entire tag set as follows:

>>> ds 
(0008, 0012) Instance Creation Date              DA: '20030903' 
(0008, 0013) Instance Creation Time              TM: '150031' 
(0008, 0016) SOP Class UID                       UI: RT Plan Storage 
(0008, 0018) Diagnosis                        UI: Positive  
(0008, 0020) Study Date                          DA: '20030716' 
(0008, 0030) Study Time                          TM: '153557' 
(0008, 0050) Accession Number                    SH: '' 
(0008, 0060) Modality                            CS: 'RTPLAN'

Suppose we were seeking the diagnosis. We would look through several files of tags and try to see whether the diagnosis consistently shows up under tag (0008, 0018) Diagnosis, and if so, we'd test our hypothesis by pulling out just this field from a large portion of our training set to see whether it is indeed consistently populated. If it is, we're ready for the next step. If not, we need to start again and look at other fields. Theoretically, the data provider, broker, or vendor can provide this information, but, practically speaking, it is rarely that simple.

The next step is to see the domain of values. This is very important because we want to see what our classes look like. Ideally, we will have a nice clean set of values such as {Negative, Positive}, but, in reality, we often get a long tail of dirty values. So, the typical approach is to loop through every single image and keep a count of each unique domain value encountered, as follows:

>>> import dicom, glob, os 
>>> os.chdir("/some/medical/data/dir") 
>>> domains={} 
>>> for file in glob.glob("*.dcm"): 
>>>    aMedFile = dicom.read_file(file) 
>>>    theVal=aMedFile.ds[0x10,0x10].value 
>>>    if domains[theVal]>0: 
>>>       domains[theVal]= domains[theVal]+1 
>>>    else: 
>>>       domains[theVal]=1

A very common finding at this point would be that 99 percent of domain values exist across a handful of domain values (such as positive and negative), and there is a long tail of 1% domain values that are dirty (such as positive, but under review, @#Q#$%@#$%, or sent for re-read). The easiest thing to do is throw out the long tail—just keep the good data. This is especially easy if there is plenty of training data.

OK, so we've extracted the class information, but we've still got to extract the actual image. We can do that as follows:

>>> import dicom 
>>> ds=dicom.read_file("MR_small.dcm") 
>>> ds.pixel_array 
array([[ 905, 1019, 1227, ...,  302,  304,  328], 
       [ 628,  770,  907, ...,  298,  331,  355], 
       [ 498,  566,  706, ...,  280,  285,  320], 
       ..., 
       [ 334,  400,  431, ..., 1094, 1068, 1083], 
       [ 339,  377,  413, ..., 1318, 1346, 1336], 
       [ 378,  374,  422, ..., 1369, 1129,  862]], dtype=int16) 
>>> ds.pixel_array.shape 
(64, 64)

Unfortunately, this only gives us a raw matrix of pixel values. We still need to convert this into a readable format (ideally, JPEG or PNG.) We'll achieve the next step as follows:

Next, we'll scale the image to the bit length we desire and write the matrix to a file using another library geared to writing data in our destination format. In our case, we'll use a PNG output format and write it using the png library. This means some extra imports:

import os 
from pydicom import dicomio 
import png 
import errno 
import fnmatch

We'll export like this:

Table of Contents for Reading medical data

Create new playlist

Sign In

Sign Up

Table of Contents for
Reading medical data