Chapter 1

Introduction to model management and analytics

Bedir TekinerdoganaÖnder BaburbLoek Cleophasb,cMark van den BrandbMehmet Akşitd    aInformation Technology Group, Wageningen University, Wageningen, The Netherlands
bEindhoven University of Technology, Eindhoven, The Netherlands
cStellenbosch University, Matieland, Republic of South Africa
dUniversity of Twente, Computer Science, Formal Methods & Tools Group, Enschede, The Netherlands

Abstract

Model management and analytics (MMA) aims to use models and related artifacts to derive relevant information to support decision making processes of organizations. Various different models as well as analytics approaches could be identified. In addition, MMA systems will have different requirements and as such apply different architecture design configurations. In this chapter we will discuss the key concepts related to MMA by referring also to existing data analytics approaches. We present and discuss the current key data analytics reference architectures in the literature and the key requirements for MMA. Subsequently, we provide the reference architecture of MMA using multiple architecture viewpoints. We illustrate the framework using a real-world example.

Keywords

model-driven engineering; scalability; model analytics; data mining; Machine Learning

1.1 Introduction

The idea of creating business value from data has always been an important concern. Many businesses have extracted information from data to gain new insights and make smarter decisions. Together with the advancements of disruptive technologies such as Cloud Computing and Internet of Things, the ability to capture and store vast amounts of data has grown at an unprecedented rate and soon did not scale with traditional data management techniques. Yet, to cope with the rapidly increasing volume, variety, and velocity of the generated data we can now adopt the available novel technical capacity and the infrastructure to aggregate and analyze huge and variant sets of data. This situation has led to new and unforeseen opportunities for many organizations. Data science, and Big Data in particular, has now become a very important driver for innovation and growth for various industries, such as health, administration, agriculture, and education.

Big Data is usually characterized using four V's, that is, volume, variety, velocity, and veracity. Volume is one the key characteristics of Big Data, indicating the large amount of data that usually does not fit on one computer. Variety relates to the different data types, including structured data, semistructured data, and unstructured data. Velocity refers to the speed at which the data are generated as well as being processed. Finally, veracity refers to the trustworthiness of the data.

With Big Data, different types of data have been addressed, including text, audio, and video. A different type of data that did not get much attention yet is the whole set of models that is used in various engineering and science disciplines. A model is an abstract representation of selected parts of the considered domain. In science, models can relate to, for example, physical and chemical models. In engineering, models can relate to the intermediate artifacts for realizing the eventual system. In software engineering, models are, for example, the UML design artifacts.

In this context, model management and analytics (MMA) aims to use models and related artifacts to derive relevant information to support, in the most general sense, decision making processes of organizations. This can have several benefits, including the understanding, development, maintenance, and management of those artifacts. Various different models as well as analytics approaches could be identified. In this chapter we will borrow from existing domains, potentially to be adapted and exploited for MMA.

1.2 Data analytics concepts

Data analytics is the discovery, interpretation, and communication of meaningful patterns in data. Using data analytics, data are examined in order to gain insight, from which one can make decisions and take actions that lead to effective outcomes. Hence, the goal of analytics is usually to improve the business by gaining knowledge which can be used to make improvements or changes. Traditionally, data analytics predominantly refers to various set of applications, such as basic business intelligence (BI), online analytical processing (OLAP), and various forms of advanced analytics. Data analytics is related to the term business analytics, with the difference that the latter is focused on business uses, while data analytics has a broader focus.

Data analytics is important to support decision making and likewise help organizations better achieve their business goals. Different types of analytics can be distinguished, including:

  • •  Descriptive analytics uses historical data to provide insight into what has happened.
  • •  Diagnostics analytics again uses the historical data to answer why something has happened.
  • •  Predictive analytics elaborates on the insight of the analyzed data and answers what can happen.
  • •  Prescriptive analytics, which usually builds on and uses the other types of analytics, answers the question what to do.

Each of these types of analytics can be understood quite simply as using data to answer different types of questions. In the case of data analytics the source is the traditional data (e.g., text, audio, video, etc.). With model analytics we assume that the analytics takes as input a set of models to support the decision making process.

Model analytics can thus be considered as a subcategory of data analytics orienting particularly on model artifacts.

Based on the observations above, we thus advocate a perspective where model-* (as an umbrella term for model-driven, model-based, and other related approaches) artifacts are treated holistically as data, processed, and analyzed with various scalable and efficient techniques, possibly inspired by related domains. Tackling large volumes of artifacts has been commonplace in other domains, such as text mining for natural language text [1] and repository mining for source code [2]. While we might not be able to apply those techniques as-is on model-* artifacts, the general data analytics workflow appears to be applicable with the steps of data collection, cleaning, integration and transformation, feature engineering and selection, modeling (e.g., as statistical models or neural networks), and finally deployment, exploration, and visualization.

1.3 The inflation of modeling artifacts

The problem of the large number of models is not new in the model-* world. Several efforts have indeed been initiated to store and manage large numbers of models and related artifacts [3,4]. Further efforts include mining public repositories for MDE-related items from GitHub, e.g., Eclipse-based MDE technologies [5] and UML models [6] (the Lindholmen dataset). In the latter, the number of UML models can reach more than 90k. We have previously demonstrated the increasing number of newly created Ecore metamodels, as well as a number of commits on them, on GitHub [7]. A part of the results is given in Fig. 1.1, depicting a strong upward trend for the number of commits and newly created files over the years. It is evident that there are increasingly more Ecore metamodels in GitHub. We have further evidence that other types of modeling artifacts and repositories follow similar trends [8].

Image
Figure 1.1 GitHub results on Ecore metamodels. (A) Number of commits per year. (B) New files per year.

Moreover, it appears that even within a single industry or organization, a similar situation emerges with the increasing adoption of model-* approaches. Just a single company might need to manage an ecosystem consisting of dozens of metamodels, thousands of models, and tens of thousands LOCs of transformations. With the complete revision history, the total number of artifacts can reach tens of thousands. Along with conventional forward engineering approaches, we can observe an increasing trend with legacy software: automated migration into model-driven/-based engineering using process mining and model learning.

All the presented facts, from open source as well as industry, let us confirm the statement by Brambilla et al. [9] and Whittle et al. [10] that the adoption of model-* approaches in (at least some parts of) the industry grows quite rapidly, and we conclude that tackling these artifacts, namely MMA, will be increasingly important in the future.

1.4 Relevant domains for MMA

Despite the different nature of models, as exemplified above, we can be inspired by techniques from other disciplines and try to adapt them for the problems in MMA. As a preliminary overview, in this section we list and discuss several such domains. While there is related model-* research on some of the items on the list, we believe a conscious and integrated mindset would mitigate the challenges for scalable application of model-* approaches.

Descriptive statistics  Several model-* researchers have already performed empirical studies on model-* artifacts with a statistical mindset. For instance, Kolovos et al. assess the use of Eclipse technologies in GitHub, giving related trend analyses [5]. Mengerink et al. present an automated analysis framework on version control systems with similar capabilities [11]. Di Rocco et al. perform a correlation analysis on metrics for various model-* artifacts [12]. Descriptive statistics could in the most general sense be exploited to gain insight in large numbers of model-* artifacts in terms of general characteristics, patterns, outliers, statistical distributions, dependence, etc.

Information retrieval  Techniques from information retrieval (IR) can facilitate indexing, searching, and retrieving of models, and thus their management and reuse. The adoption of IR techniques on source code dates back to the early 2000s, and within the model-* community there has been some recent effort in this direction (e.g., by Bislimovska et al. [13]). Further IR-based techniques can be found in [14,15] involving repository management and model searching scenarios.

Natural language processing  Accurate natural language processing (NLP) is needed to handle realistic models with noisy text content, compound words, and synonymy/polysemy. In our experience, it is very problematic to blindly use NLP tools on models, e.g., just WordNet synonym checking without proper part-of-speech tagging and word sense disambiguation. More research is needed to find the right chain of NLP tools applicable for models (in contrast to source code and documentation), and reporting accuracies and disagreements between tools (along the lines of the recent report in [16] for repository mining). Note that NLP offers further advanced tools, such as language modeling, which are still to be investigated for model-* approaches.

Data mining  Following the perspective of approaching model-* artifacts as data, we need scalable techniques to extract relevant units of information from models (features in data mining jargon) and to discover patterns, including clusters, outliers/noise, and clones. Several example applications can be found in [14,15,17] for domain clustering EMF metamodels, and in [18] for classifying forward vs. reverse engineered UML models. To analyze, explore, and eventually make sense of the large model-* datasets (e.g., the Lindholmen dataset [6]), we can investigate what can be borrowed from comparable approaches in data mining for structured data.

Graph databases and graph-based methods  Given that quite some commonly used models, such as UML, are based on an underlying graph, graph databases can be used to store, query, and reason about models. There has already been some effort using graph databases, such as Neo4EMF [19] as a persistence layer for (potentially very big) models, and Mogwaï [20] as a fast and complex querying mechanism for models. Another related idea is presented by Clarisó and Cabot [21], who advocate using graph kernel-based methods for several model-* tasks such as model searching and clustering.

Machine learning  The increasing availability of large amounts of model-* data can be exploited, via Machine Learning, to automatically infer certain qualities and predictor functions (e.g., performance). There has been a thrust of research in this direction for source code (e.g., for fault prediction [23]), and it would be noteworthy to investigate the emerging needs of the model-* communities and feasibility of such learning techniques for model-* approaches. The approaches in [24] for learning model transformations by examples and [25] for automatic model repair using reinforcement learning are some of the few pieces of such work in model-* approaches.

Visualization  We propose visualization and visual analytics techniques to inspect a whole dataset of artifacts (e.g., cluster visualizations in [15], in contrast with visualizing a single big model in [26]) using various features such as metrics and cross-artifact relationships. The goals could range from exploring a repository to analyzing a model-* ecosystem holistically and even studying the (co-)evolution of model-* artifacts.

Distributed/parallel computing  With the growing amount of data to be processed, employing distributed and parallel algorithms in model-* approaches is very relevant. There are conceptually related approaches worthwhile investigating, e.g., distributed model transformations for very large models [27,28] or model-driven data analytics [29]. Yet we wish to draw attention here to performing computationally heavy data mining or Machine Learning tasks for large model-* datasets in an efficient way.

(Big) data analytics  One of the key domains that is related to and required for managing large collections of models is obviously Big Data. Based on the literature, we have derived the family feature model for Big Data systems, as shown in Fig. 1.2. This has been largely adopted from our earlier work on a feature-based analysis of Big Data systems [22]. The feature diagram contains features representing the essential characteristics or externally visible properties of the system. Features may be mandatory or alternative/optional and may have subfeatures which as such can lead to a hierarchical tree. Besides the conceptual model, MMA can get inspiration from the different Big Data reference architectures in the literature. Notable ones would include the lambda architecture, which is a three-layer architecture aimed at a robust and fault-tolerant system [30], and a functional architecture with six key modules representing phases from data extraction and loading to interfacing and visualization [31]. MMA could be seen from the Big Data perspective. In that case, the MMA approaches can be represented using the features of the family feature diagram of Fig. 1.2.

Image
Figure 1.2 Top-level feature model for Big Data systems (adapted from [22]).

The above represents a nonexhaustive list of domains as a preliminary exploitation guideline for MMA. Although the aforementioned domains themselves are quite mature on their own, it should be investigated to what extent results and approaches can be transferred into the model-* technical space, particularly for MMA. The chapters in this book approach the topic from these and other different perspectives, and likewise provide valuable insight in the MMA domain.

References

[1] A. Hotho, A. Nürnberger, G. Paass, A brief survey of text mining, LDV Forum 2005;20:19–62.

[2] H. Kagdi, M.L. Collard, J.I. Maletic, A survey and taxonomy of approaches for mining software repositories in the context of software evolution, Journal of Software Maintenance and Evolution 2007;19:77–131.

[3] F. Basciani, J.D. Rocco, D.D. Ruscio, A.D. Salle, L. Iovino, A. Pierantonio, Mdeforge: an extensible web-based modeling platform, Proceedings of the 2nd International Workshop on Model-Driven Engineering on and for the Cloud Co-Located with the 17th International Conference on Model Driven Engineering Languages and Systems. CloudMDE@MoDELS 2014, Valencia, Spain, September 30, 2014. 2014:66–75. http://ceur-ws.org/Vol-1242/paper10.pdf.

[4] H. Störrle, R. Hebig, A. Knapp, An index for software engineering models, Joint Proceedings of MODELS 2014 Poster Session and the ACM Student Research Competition (SRC) Co-Located with the 17th International Conference on Model Driven Engineering Languages and Systems. MODELS 2014, Valencia, Spain, September 28 – October 3, 2014. 2014:36–40. http://ceur-ws.org/Vol-1258/poster8.pdf.

[5] D.S. Kolovos, N.D. Matragkas, I. Korkontzelos, S. Ananiadou, R.F. Paige, Assessing the use of eclipse MDE technologies in open-source software projects, OSS4MDE@MoDELS. 2015:20–29.

[6] R. Hebig, T. Ho-Quang, M.R.V. Chaudron, G. Robles, M.A. Fernández, The quest for open source projects that use UML: mining GitHub, Proceedings of the ACM/IEEE 19th International Conference on Model Driven Engineering Languages and Systems. Saint-Malo, France, October 2–7, 2016. 2016:173–183. http://dl.acm.org/citation.cfm?id=2976778.

[7] Ö. Babur, L. Cleophas, M. van den Brand, B. Tekinerdogan, M. Aksit, Models, more models, and then a lot more, M. Seidl, S. Zschaler, eds. Software Technologies: Applications and Foundations. Cham: Springer International Publishing; 2018:129–135.

[8] Ö. Babur, Model analytics and management. [IPA Dissertation Series] Technische Universiteit Eindhoven; 2019.

[9] M. Brambilla, J. Cabot, M. Wimmer, Model-Driven Software Engineering in Practice. 1st edition Morgan & Claypool Publishers; 2012.

[10] J. Whittle, J. Hutchinson, M. Rouncefield, The state of practice in model-driven engineering, IEEE Software 2014;31:79–85.

[11] J.G. Mengerink, A. Serebrenik, R.R. Schiffelers, M.G. van den Brand, Automated analyses of model-driven artifacts: obtaining insights into industrial application of MDE, Proceedings of the 27th International Workshop on Software Measurement and 12th International Conference on Software Process and Product Measurement. ACM; 2017:116–121.

[12] J. Di Rocco, D. Di Ruscio, L. Iovino, A. Pierantonio, Mining metrics for understanding metamodel characteristics, Proceedings of the 6th International Workshop on Modeling in Software Engineering. MiSE 2014. New York, NY, USA: ACM; 2014:55–60 10.1145/2593770.2593774. http://doi.acm.org/10.1145/2593770.2593774.

[13] B. Bislimovska, A. Bozzon, M. Brambilla, P. Fraternali, Textual and content-based search in repositories of web application models, ACM Transactions on the Web (TWEB) 2014;8, 11.

[14] Ö. Babur, L. Cleophas, M. van den Brand, Hierarchical clustering of metamodels for comparative analysis and visualization, Proc. of the 12th European Conf. on Modelling Foundations and Applications. 2016:2–18.

[15] F. Basciani, J. Di Rocco, D. Di Ruscio, L. Iovino, A. Pierantonio, Automated clustering of metamodel repositories, International Conference on Advanced Information Systems Engineering. Springer; 2016:342–358.

[16] F.N.A.A. Omran, C. Treude, Choosing an NLP library for analyzing software documentation: a systematic literature review and a series of experiments, Proceedings of the 14th International Conference on Mining Software Repositories. MSR 2017, Buenos Aires, Argentina, May 20–28, 2017. 2017:187–197 10.1109/MSR.2017.42. https://doi.org/10.1109/MSR.2017.42.

[17] Ö. Babur, Statistical analysis of large sets of models, Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ASE 2016. New York, NY, USA: ACM; 2016:888–891.

[18] M.H. Osman, T. Ho-Quang, M. Chaudron, An automated approach for classifying reverse-engineered and forward-engineered UML class diagrams, 2018 44th Euromicro Conference on Software Engineering and Advanced Applications. SEAA. IEEE; 2018:396–399.

[19] A. Benelallam, A. Gómez, G. Sunyé, M. Tisi, D. Launay, Neo4EMF, a scalable persistence layer for EMF models, European Conference on Modelling Foundations and Applications. Springer; 2014:230–241.

[20] G. Daniel, G. Sunyé, J. Cabot, Mogwaï: a framework to handle complex queries on large models, Research Challenges in Information Science (RCIS), 2016 IEEE Tenth International Conference on. IEEE; 2016:1–12.

[21] R. Clarisó, J. Cabot, Applying graph kernels to model-driven engineering problems, Proceedings of the 1st International Workshop on Machine Learning and Software Engineering in Symbiosis. MASES 2018. 2018:1–5.

[22] C.A. Salma, B. Tekinerdogan, I.N. Athanasiadis, Feature driven survey of big data systems, IoTBD. 2016:348–355.

[23] C. Catal, B. Diri, A systematic review of software fault prediction studies, Expert Systems with Applications 2009;36:7346–7354.

[24] I. Baki, H.A. Sahraoui, Multi-step learning and adaptive search for learning complex model transformations from examples, ACM Transactions on Software Engineering and Methodology 2016;25, 20.

[25] A. Barriga, A. Rutle, R. Heldal, Automatic model repair using reinforcement learning, Proc. of MODELS 2018 Workshops, Co-Located with ACM/IEEE 21st Int. Conf. on Model Driven Engineering Languages and Systems. MODELS 2018, Copenhagen, Denmark, October, 14, 2018. 2018:781–786.

[26] D.S. Kolovos, L.M. Rose, N. Matragkas, R.F. Paige, E. Guerra, J.S. Cuadrado, J. De Lara, I. Ráth, D. Varró, M. Tisi, J. Cabot, A research roadmap towards achieving scalability in model driven engineering, Proceedings of the Workshop on Scalability in Model Driven Engineering. BigMDE '13. New York, NY, USA: ACM; 2013, 2 10.1145/2487766.2487768. http://doi.acm.org/10.1145/2487766.2487768.

[27] A. Benelallam, A. Gómez, M. Tisi, J. Cabot, Distributed model-to-model transformation with ATL on MapReduce, Proceedings of the 2015 ACM SIGPLAN International Conference on Software Language Engineering. ACM; 2015:37–48.

[28] L. Burgueño, M. Wimmer, A. Vallecillo, Towards distributed model transformations with LinTra, 2016.

[29] T. Hartmann, A. Moawad, F. Fouquet, G. Nain, J. Klein, Y.L. Traon, J.-M. Jezequel, Model-driven analytics: connecting data, domain knowledge, and learning, preprint arXiv:1704.01320; 2017.

[30] N. Marz, J. Warren, Big Data: Principles and Best Practices of Scalable Real-Time Data Systems. New York: Manning Publications Co.; 2015.

[31] D. Chappelle, Big data & analytics reference architecture. [An Oracle White Paper] 2013.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.229.88