Structure your unstructured data first!

The case of summarizing unstructured data with tag clouds

A. Bacchelli    Delft University of Technology, Delft, The Netherlands

Abstract

Unstructured software data, such as emails and discussions in technical forum, are a rich form of information about software systems. Nevertheless, mining this form of data is hard as it comprises different languages that cannot be processed with the same techniques.

In this chapter, we show how we can summarize unstructured software data by first giving it the structure it needs.

Keywords

Unstructured software data; Natural language; Source code; Unstructured software data summarization; Tag cloud; Classification; Source code detection approach; Parse

Unstructured Data in Software Engineering

Anyone who has never worked on a real software project might mistakenly believe that software engineers spend all their time reading and writing source code.

But software engineers do much more than just read and write code. Their day-to-day reality is that they spend much time writing a wide range of material—little of which is source code (we see this happening even in open-source software projects, where there is not really a paper-driven or manager-mandated development process). Accordingly, it is very important to discuss methods for handling data that is not source code, particularly unstructured software data: data written mostly in natural language to exchange information with other people.

Look, for example, at Fig. 1. It represents the volume of emails exchanged monthly in the Linux kernel mailing list, from Jun. 1995 to Mar. 2010. With a constantly growing trend, we see that in the last month we consider, developers exchanged 13,657 emails, or 440 emails per day.

f31-01-9780128042069
Fig. 1 Number of emails exchanged in the development mailing list of the Linux Kernel, by month.

And emails are just one of the types of documents that software engineer produce and read daily. They also write and read issue reports, design documents, commit messages, code review messages, etc. All of these form the so-called unstructured software data, ie, data written in natural language by people for other people (as opposed to source code, which is written for a computer, or log messages, which are generated by a computer for a human).

Summarizing Unstructured Software Data

With that much information available, it is doubtful that engineers have the chance to read all the documents and get the right information out of them. For this reason, researchers in software engineering looked into various techniques to summarize, or aggregate, the various types of information available in unstructured software data.

As Simple as Possible… But not Simpler!

The first reasonable approach for unstructured software data summarization is to tap into the methods devised by the community of Information Retrieval, whose target is exactly to retrieve information from natural language documents, and to combine them with some basic visualization techniques.

So, let us say we want to summarize the email information produced in just a single discussion thread on the mailing list of an open-source software project. A common solution would be to collect all the terms present in those emails and display them in a tag cloud [1] where the most frequently occurring terms are represented in larger fonts and the less frequently occurring ones are represented in smaller fonts. The more visible a term is in the tag cloud, the greater should be its semantic value. Terms are defined as single words, divided by space or punctuation, or some special characters.

Let us take the content of the following email in Fig. 2, adapted from a real email sent to the mailing list of an open-source system (ArgoUML).

f31-02-9780128042069
Fig. 2 A development email sent to the development mailing list of an open-source software.

If we had to summarize the content, we could say that there are three people involved and the content refers to a bug affecting OS X, which is solved by patching the class “Explorer.” How good would a tag cloud represent this threaded email? If we create a tag cloud directly from it, we obtain the following (see Fig. 3).

f31-03-9780128042069
Fig. 3 A tag cloud directly generated from a development email, without pre-processing.

The tag cloud already gives us some interesting information, but it also contains a lot of noise, which reduces the visibility of what really matters. Apparently, taking “off-the-shelf” methods from information retrieval and data visualization is quick and simple, but it is probably a “too simple” solution, not good enough to solve our problem.

You Need Structure!

Why did the preceding method not work well? Because most off-the-shelf techniques are prepared to work on a very specific kind of data. If we took newspaper articles and followed the same approach, we would probably have obtained more interesting results. But documents generated by software engineers are substantially different from those generated by writing professionals, such as newspaper journalists. Developers might use jargon more often and have a more terse language, and there is a lot of implicit knowledge not expressed. Then, emails contain text that is not relevant for analyzing their content, such as authors' signatures. Finally, and perhaps most important, software engineers often mix languages: natural language is mixed with source code, stack traces, and patches.

If we really want to extract useful information, which can be summarized, from unstructured software data, we first have to find the latent structure it has and exploit it correctly.

How can we give structure to a wall of text, such as an email? We first can realize that there is a simple underlying structure we can and should take advantage of, and build on top of it. We should take into account that threaded emails already give us information about how many people are involved in the discussion (the starting “>”s tell us about the indentation level). Then we can exploit a structure that is already there: emails are divided by authors in lines; our previous research showed that this is a very good starting point in finding the structure of a message [2]. Considering these two aspects, we obtain the following version of the same email (see Fig. 4), which visually already gives us much more information (this is what every good email client would do).

f31-04-9780128042069
Fig. 4 A development email with line numbers and information about quotation levels.

Once we have this basic structure, the next step is realizing that, in most of the cases, the different languages used in emails are used in different lines. In this way, we simplified our problem into a classification problem: how can we assign each line to the language it belongs to? From our email, we would like to have a categorization that looks like the following (see Fig. 5).

f31-05-9780128042069
Fig. 5 A development email in which the different used "languages" are visible recognized.

Is it possible to automatically tag each line with the language it is written in? Yes! Researchers developed a number of methods to classify text in different categories. A classical example is the case of classifying a whole email into “spam” or legitimate. In the case of development emails, we developed simple methods to recognize lines of code from other text [2] and more complex ones to recognize more complex languages, such as those used in signatures, from natural language content [3]. As an example, here we describe how the lines of Java source code can be recognized in the content of an email, with a very simple, yet effective approach, and see what its impact is on a final tag cloud summary.

If we consider lines 17–20 and 25–28, we note a peculiarity present in many programming languages (eg, Java, C, C#, Perl): the developer must end each statement with a semicolon or a curly bracket (mainly used to open or close a block). Based on this intuition, a simple approach that verifies whether the last character of a line is a semicolon or a curly bracket might be a good way to detect source code. Plus, we can write a simple regular expression to recognize the lines of a stack trace (9–11), as they have a very clear structure, without any nested blocks.

We tested this approach on thousands of emails from five Java source systems and we found that it is working well in practice and can be used as a basis for further data analysis on emails [2].

Now, if we take our initial email, we remove the signatures (by for example, eliminating everything at the end after the dashes) and apply the source code detection approach, we can generate a new tag cloud, as the one depicted below (see Fig. 6). Now we see that more important terms (such as “Explorer” and “NullPointerException”) started to emerge, thus creating a summary that gives a much better idea of the content.

f31-06-9780128042069
Fig. 6 A tag cloud generated from preprocessed email content.

With more sophisticated approaches [3], we can even parse the different parts of the content of a development email and remove more noise, thus making the most important content emerge and creating better summaries.

Conclusion

In this brief chapter, we made everybody aware that developers do not only write source code, but many other artifacts, such as issue reports, design documents, and emails.

We showed that the amount of these artifacts can be overwhelming (as in the case of development emails for the Linux mailing list), so we need some way to summarize the information they contain for a faster, yet still useful consumption. We made the case that, unfortunately, the best candidate techniques from information retrieval do not work well in the case of software engineering documents, because unstructured software data has a very special language. In particular, software engineers often mix up many languages in the same document: natural language, source code, stack traces, etc.

For this, we presented the steps that one can take to transform the very unstructured content of an email, into something that can be analyzed to generate a valid summary. These steps involve the recognition of the latent structure of documents and the languages in which they are composed. Surprisingly, a simple approach is able to detect reasonably most of these languages and to remove them from text if necessary.

As a result, we were able to obtain a much more informative tag cloud that summarizes the content of a threaded email, thus giving ideas on how an approach can be developed to analyze and summarize similar unstructured software documents.

References

[1] Tag cloud. Wikipedia, the free encyclopedia. Retrieved from https://en.wikipedia.org/w/index.php?title=Tag_cloud&oldid=678980330 [accessed 01.09.15]

[2] Bacchelli A., D'Ambros M., Lanza M. Extracting source code from e-mails. In: Proc. 18th IEEE international conference on program comprehension (ICPC 2010); 2010:24–33.

[3] Bacchelli A., Dal Sasso T., D'Ambros M., Lanza M. Content classification of development emails. In: Proc. 34th international conference on software engineering (ICSE 2012); 2012:375–385.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.133.233