Problem-specific techniques

As the term suggests, problem-specific techniques are applied when we take into account the particular characteristics of our current dataset and proceed to process it accordingly. In general, there is no way to tell what particularities your dataset contains; the only approach is to, again, extensively explore the dataset and address any problems as they arise.

In the following steps, we will go over some more pre-processing techniques that are specific to our datasets, so that you will have more experience dealing with badly formatted data. Note that, while the gist of each section of our code will be discussed, if you have trouble understanding the effect of any specific command, you can examine it further in the documentation of Pandas, included in the Further reading section of this chapter:

  1. First, let's read a sample file from the Tappy Data folder:
#%% Explore the second dataset

file_name = '0EA27ICBLF_1607.txt' # an arbitrary file to explore


df = pd.read_csv(
'data/Tappy Data/' + file_name,
delimiter = ' ',
index_col = False,
names = ['UserKey', 'Date', 'Timestamp', 'Hand', 'Hold time',
'Direction', 'Latency time', 'Flight time']
)

df = df.drop('UserKey', axis=1)

print(df.head())
  1. Now, use the SciView to inspect this df variable:

We see that this is the same data that we saw in the Working with datasets section of this chapter, now formatted as a Pandas DataFrame. Next, we will need to perform various preprocessing techniques on this dataset.

  1. From here, we can also see that we need to convert the datetime columns into their appropriate data types:

#%% Format datetime data

df['Date'] = pd.to_datetime(df['Date'], errors='coerce', format='%y%M%d').dt.date
# converting time data to numeric
for column in ['Hold time', 'Latency time', 'Flight time']:
df[column] = pd.to_numeric(df[column], errors='coerce')

df = df.dropna(axis=0)

print(df.head())
  1. Now, each of the Hand and Direction columns have a fixed set of valid values. In particular, each cell in Hand should hold the value of L (left), R (right), or S (spacebar) and the data in Direction is one of the nine possibilities going from one of the three values to another (LL, LR, LS, and so on). For this reason, we would like to filter out the rows that don't hold one of these values in the two columns, using the code in the next block:
#%% Remove incorrect data

# cleaning data in Hand
df = df[
(df['Hand'] == 'L') |
(df['Hand'] == 'R') |
(df['Hand'] == 'S')
]

# cleaning data in Direction
df = df[
(df['Direction'] == 'LL') |
(df['Direction'] == 'LR') |
(df['Direction'] == 'LS') |
(df['Direction'] == 'RL') |
(df['Direction'] == 'RR') |
(df['Direction'] == 'RS') |
(df['Direction'] == 'SL') |
(df['Direction'] == 'SR') |
(df['Direction'] == 'SS')
]

print(df.head())

Note that our current data might not contain any of the invalid values as of now, but it is good practice to have this filtering logic in our code in case our data is changed or updated in the future, ensuring that our pipeline stays consistent.

  1. Next, recall that what we have been working with so far is typing speed data for a specific patient at a given time. A patient is simply a single data point within our first dataset, and we would like to combine the two datasets together somehow, so we need a way to aggregate our current data into a single data point.

    Since we are working with numerical data (typing time), we can take the average (mean) of the time data across different columns as a way to summarize the data of a given user. We can achieve this with the groupby() function from Pandas in the next code cell:
#%% Group by direction (hand transition)

direction_group_df = df.groupby('Direction').mean()
print(direction_group_df)
  1. Let's now inspect this direction_group_df DataFrame in SciView:

As we can see, this DataFrame is divided into rows of different Direction data (LL, LR, LS, and so on), and its columns are the different time-based attributes. This is what we want as a single data point that can be appended to our first dataset.

  1. Now, remember that this dataset was computed with a single file in the Tappy Data folder. However, we need to iterate through all of the files in that folder. To do that, we first refactor all of the data-manipulation logic so far into a function in the next code cell:
#%% Combine into one function

def read_tappy(file_name):
df = pd.read_csv(
'data/Tappy Data/' + file_name,
delimiter=' ',
index_col=False,
names=['UserKey', 'Date', 'Timestamp', 'Hand', 'Hold time',
'Direction', 'Latency time', 'Flight time']
)

df = df.drop('UserKey', axis=1)

df['Date'] = pd.to_datetime(df['Date'], errors='coerce', format='%y%M%d').dt.date

# Convert time data to numeric
for column in ['Hold time', 'Latency time', 'Flight time']:
df[column] = pd.to_numeric(df[column], errors='coerce')
df = df.dropna(axis=0)

# Clean data in `Hand`
df = df[
(df['Hand'] == 'L') |
(df['Hand'] == 'R') |
(df['Hand'] == 'S')
]

# Clean data in `Direction`
df = df[
(df['Direction'] == 'LL') |
(df['Direction'] == 'LR') |
(df['Direction'] == 'LS') |
(df['Direction'] == 'RL') |
(df['Direction'] == 'RR') |
(df['Direction'] == 'RS') |
(df['Direction'] == 'SL') |
(df['Direction'] == 'SR') |
(df['Direction'] == 'SS')
]

direction_group_df = df.groupby('Direction').mean()
del df
gc.collect()

direction_group_df = direction_group_df.reindex(
['LL', 'LR', 'LS', 'RL', 'RR', 'RS', 'SL', 'SR', 'SS'])
direction_group_df = direction_group_df.sort_index() # to
ensure correct order of data

return direction_group_df.values.flatten() # returning a
numpy array

Specifically, this read_tappy() function takes in a filename in the Tappy Data folder and performs the same processing that we discussed in the previous steps. This function will return the aggregated averaged time data that we saw as a flattened (1-dimensional) NumPy array. This is necessary for us to be able to append it to our first dataset.

  1. Then, we have another function, process_user(), that iterates through all of the files associated with a common patient and calls read_tappy() to process those files:
def process_user(user_id, filenames):
running_user_data = np.array([])

for filename in filenames:
if user_id in filename:
running_user_data = np.append(running_user_data,
read_tappy(filename))

running_user_data = np.reshape(running_user_data, (-1, 27)) #
flatten time data

return np.nanmean(running_user_data, axis=0) # ignoring NaNs
while calculating the mean

In the end, this function returns a summary of all of the time-related data of a specific patient.

  1. In the next code cell, we finally iterate through all of the valid patient IDs and call process_user() using a for loop:
#%% Run through all available data

import warnings; warnings.filterwarnings("ignore")


filenames = os.listdir('data/Tappy Data/')

column_names = [first_hand + second_hand + '_' + time
for first_hand in ['L', 'R', 'S']
for second_hand in ['L', 'R', 'S']
for time in ['Hold time', 'Latency time', 'Flight time']]

user_tappy_df = pd.DataFrame(columns=column_names)

for user_id in user_df.index:
user_tappy_data = process_user(str(user_id), filenames)
user_tappy_df.loc[user_id] = user_tappy_data

# Some preliminary data cleaning
user_tappy_df = user_tappy_df.fillna(0)
user_tappy_df[user_tappy_df < 0] = 0

print(user_tappy_df.head())

As we iterate through the for loop, we call process_user() to obtain the aggregated data and append it to a running DataFrame stored in user_tappy_df.

  1. When the code block finishes executing, let's open it in our SciView:

As you can see, our current DataFrame is indexed by the unique patient IDs and, at the same time, contains data on their typing speed as individual columns. This is the exact format that we want our data to be in.

That technique also concludes our discussion on data cleaning and pre-processing methods that are specific to our example dataset. Next, we finally combine our two datasets and write them to file.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.103.10