Parsing binary ITCH messages

The ITCH v5.0 specification declares over 20 message types related to system events, stock characteristics, the placement and modification of limit orders, and trade execution. It also contains information about the net order imbalance before the open and closing cross.

The Nasdaq offers samples of daily binary files for several months. The GitHub repository for this chapter contains a notebook, build_order_book.ipynb that illustrates how to parse a sample file of ITCH messages and reconstruct both the executed trades and the order book for any given tick. 

The following table shows the frequency of the most common message types for the sample file date March 29, 2018:

Message type

Order book impact

Number of messages

A

New unattributed limit order

136,522,761

D

Order canceled

133,811,007

U

Order canceled and replaced

21,941,015

E

Full or partial execution; possibly multiple messages for the same original order

6,687,379

X

Modified after partial cancellation

5,088,959

F

Add attributed order

2,718,602

P

Trade Message (non-cross)

1,120,861

C

Executed in whole or in part at a price different from the initial display price

157,442

Q

Cross Trade Message

17,233

 

For each message, the specification lays out the components and their respective length and data types:

Name

Offset

Length

Value

Notes

Message type

0

1

F

Add Order MPID attribution message

Stock locate

1

2

Integer

Locate code identifying the security

Tracking number

3

2

Integer

Nasdaq internal tracking number

Timestamp

5

6

Integer

Nanoseconds since midnight

Order reference number

11

8

Integer

Unique reference number of the new order

Buy/sell indicator 

19

1

Alpha

The type of order: B = Buy Order, S = Sell Order

Shares

20

4

Integer

Number of shares for the order being added to the book

Stock

24

8

Alpha

Stock symbol, right-padded with spaces

Price

32

4

Price (4)

The display price of the new order

Attribution

36

4

Alpha

Nasdaq Market participant identifier associated with the order

 

Python provides the struct module to parse binary data using format strings that identify the message elements by indicating length and type of the various components of the byte string as laid out in the specification.

Let's walk through the critical steps to parse the trading messages and reconstruct the order book:

  1. The ITCH parser relies on the message specifications provided as a .csv file (created by create_message_spec.py) and assembles format strings according to the formats dictionary:
formats = {
('integer', 2): 'H', # int of length 2 => format string 'H'
('integer', 4): 'I',
('integer', 6): '6s', # int of length 6 => parse as string,
convert later
('integer', 8): 'Q',
('alpha', 1) : 's',
('alpha', 2) : '2s',
('alpha', 4) : '4s',
('alpha', 8) : '8s',
('price_4', 4): 'I',
('price_8', 8): 'Q',
}
  1. The parser translates the message specs into format strings and namedtuples that capture the message content:
# Get ITCH specs and create formatting (type, length) tuples
specs = pd.read_csv('message_types.csv')
specs['formats'] = specs[['value', 'length']].apply(tuple,
axis=1).map(formats)

# Formatting for alpha fields
alpha_fields = specs[specs.value == 'alpha'].set_index('name')
alpha_msgs = alpha_fields.groupby('message_type')
alpha_formats = {k: v.to_dict() for k, v in alpha_msgs.formats}
alpha_length = {k: v.add(5).to_dict() for k, v in alpha_msgs.length}

# Generate message classes as named tuples and format strings
message_fields, fstring = {}, {}
for t, message in specs.groupby('message_type'):
message_fields[t] = namedtuple(typename=t, field_names=message.name.tolist())
fstring[t] = '>' + ''.join(message.formats.tolist())
  1. Fields of the alpha type require post-processing as defined in the format_alpha function:
def format_alpha(mtype, data):
for col in alpha_formats.get(mtype).keys():
if mtype != 'R' and col == 'stock': # stock name only in
summary message 'R'
data = data.drop(col, axis=1)
continue
data.loc[:, col] = data.loc[:, col].str.decode("utf-
8"
).str.strip()
if encoding.get(col):
data.loc[:, col] = data.loc[:,
col].map(encoding.get(col)) # int encoding
return data

  1. The binary file for a single day contains over 300,000,000 messages worth over 9 GB. The script appends the parsed result iteratively to a file in the fast HDF5 format to avoid memory constraints (see last section in this chapter for more on this format). The following (simplified) code processes the binary file and produces the parsed orders stored by message type:
with (data_path / file_name).open('rb') as data:
while True:
message_size = int.from_bytes(data.read(2), byteorder='big',
signed=False)
message_type = data.read(1).decode('ascii')
message_type_counter.update([message_type])
record = data.read(message_size - 1)
message = message_fields[message_type]._make(unpack(fstring[message_type],
record))
messages[message_type].append(message)

# deal with system events like market open/close
if message_type == 'S':
timestamp = int.from_bytes(message.timestamp,
byteorder='big')
if message.event_code.decode('ascii') == 'C': # close
store_messages(messages)
break
  1. As expected, a small number of the over 8,500 equity securities traded on this day account for most trades:
with pd.HDFStore(hdf_store) as store:
stocks = store['R'].loc[:, ['stock_locate', 'stock']]
trades = store['P'].append(store['Q'].rename(columns=
{'cross_price': 'price'}).merge(stocks)
trades['value'] = trades.shares.mul(trades.price)
trades['value_share'] = trades.value.div(trades.value.sum())
trade_summary =
trades.groupby('stock').value_share.sum().sort_values
(ascending=False)
trade_summary.iloc[:50].plot.bar(figsize=(14, 6), color='darkblue',
title='% of Traded Value')
plt.gca().yaxis.set_major_formatter(FuncFormatter(lambda y, _:
'{:.0%}'.format(y)))

We get the following plot for the graph:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
44.202.90.91