Data Science in Software Development

a course for engineers

At least 50% of the development time is typically spent on figuring out the system in order to figure out what to do next. In other words, software engineering is primarily a decision making business. Add to that the fact that often systems contain millions of lines of code and even more data, and you get an environment in which decisions have to be made quickly about lots of ever moving data. How do you approach this challenge effectively?

Description

Developers are data scientists. Or at least, they should be.

Yet, too often, developers drill into the see of data manually with only rudimentary tool support. Yes, rudimentary. The syntax highlighting and basic code navigation are nice, but they only count when looking into fine details. This approach does not scale for understanding larger pieces and it should not perpetuate.

This might sound as if it is not for everyone, but consider this: when a developer sets out to figure out something in a database with million rows, she will write a query first; yet, when the same developer sets out to figure out something in a system with a million lines of code, she will start reading. Why are these similar problems approached so differently: one time tool-based and one time through manual inspection? And if reading is such a great tool, why do we even consider queries at all? The root problem does not come from the basic skills. They exist already. The main problem is the perception of what software engineering is, and of what engineering tools should be made of.

We go through live examples of how software engineering decisions can be made quickly and accurately by building custom analysis tools that enable browsing, visualizing or measuring code and data. Once this door is open you will notice how software development changes. Dramatically.

In this course, you will get to create such custom analyses hands-on using Moose - a uniform and compact platform for creating new analyses. First, you will see how cool assessing systems can be. Second, seeing how diverse use cases can be supported by a small set of tools will challenge the default reflex of relying only on code reading.

Examples

Moose is a cool open-source platform for software and data analysis. Why cool? Because it lets you build all sorts of custom analyses very fast. Often minutes fast. Think of it as R with an highly interactive environment that is also specialized for software systems.

Let’s pick a couple of examples. Here is how you find all classes annotated with @Service that are being called from classes that have ‘ui’ in the qualified name:

model allClasses select: [:each |
  (each isAnnotatedWith: 'Service') and: [
    each clientTypes anySatisfy: [:c | '*ui*' match: c mooseName ]]]

And here is how you visualize the dependencies between a group of methods:

view nodes: methods.
view edges connectFromAll: #clientMethods.
view layout force

Rather concise. Combine this with a live environment and you can change the way you perceive legacy code.

Target audience

The course is dedicated to software engineers and architects that want to experience hands-on a different way of reasoning about software systems.

Course outline

Discussion of the code reading problem

Introduction to the Moose environment

  • The basic architecture
  • The interactive interfaces

Browsing and querying data to extract useful information:

  • Learn about the Moose query APIs
  • XML files
  • JSON files
  • Data in the DB
  • Java source code

Growing architecture

Architecture and technical debt:

  • Architecture and quality
  • The benefits and limitations of the technical debt metaphor
  • Beyond technical debt: software habitability as a positive metaphor

Parsing text to extract patterns:

  • Learn visualization scripting APIs
  • Build charts to exhibit patterns in log files
  • Build graphs
  • Integrate visualizations in interactive tools

Play through decision making scenarios without reading code

Standard duration: 2 days

Sample talk

Slideshow teaser