An Interview with Wavo's Data Science Team

The music industry has a new appreciation for data. But what exactly is Data Science, Data Engineering, and Machine Learning? And how do these buzzing fields help artists and their teams unlock new value? 

To get answers, we spoke with three members of our Data Science team here at Wavo: Arlen Stalwick, Engineering Manager and Software Architect, Monika Sharma, Data Engineer, and Andrew Kam, Software Engineer.


For starters, what is Data Science?

Arlen: “In one sentence, I'd say that the role of a data scientist is to interpret and extract meaning from data, typically very large data sets. Over the past decade or two, the volume of data being generated and collected has exploded, and along with that explosion there has been a realization that understanding, interpreting, and extracting value from the data is both desirable and quite hard. So, you can think of data science as being a sort of collision point between statistics, software engineering, and academic research into algorithms, mathematics and machine learning. Data scientists use visualizations, statistical techniques, and algorithms to explore large data sets and derive meaning from them.”

Monika: “Data science, to be brief, applies analytical algorithms to data to improve decision-making. This requires a good understanding of the analytical algorithms and data being used, as well as an extensive study of the results obtained in order to transfer insights into a business context.”

 
Wavo’s Data Mining Process, the cross-industry standard CRISP-DM.

Wavo’s Data Mining Process, the cross-industry standard CRISP-DM.

 

What is Data Engineering? What’s the difference?

Arlen: “In any organization that is actively and intentionally generating and collecting data, you're probably going to have some form of data engineering going on, whether it's called that or not. Data engineering is basically the development and implementation of tools, processes, and infrastructure to make collecting and processing large volumes of data efficient and reliable. At Wavo, we do a lot of data engineering work. The data that we collect comes from a variety of sources, so we work hard to make sure that all of that data is collected and stored in a consistent, reliable and usable way. We do this because we've learned that one of the best ways to make your data scientists (and analysts and users-of-data) more effective is to invest heavily in smart data engineers and good data engineering practices.”

Monika: “Data engineering is the younger sibling of data science. It organizes and prepares the data for data science and other business purposes. This includes building robust ETL data pipelines to process raw data collected from various data sources, data architecture, ensuring scalability of the architecture, and implementing processes to validate data quality and reliability.

Although some responsibilities of data scientists and data engineers overlap, they are different. The data scientist’s main job is to use analytical algorithms to extract valuable insights from data whereas the data engineer is more concerned with presenting data in a better way.”

What is Machine Learning?

Arlen: “There are a lot of names in the computer industry that are bad (aka ‘Artificial Intelligence’), but Machine Learning is one of the good ones, because it's pretty accurate. Machine learning is an active area of research and applied work in computer science that is oriented around getting computers to learn how to do something, without explicitly programming behaviors. So, a good example might be something like teaching a computer to distinguish between cats and dogs.

Instead of writing code to find the eyes, measure the ears, or detect the fur pattern, with machine learning, you set up a basic learning algorithm (that would work for any photos) and then you hand it a long list of photos labeled with 'cat' or 'dog'. When the computer guesses 'cat' for a dog, the machine learning model is automatically adjusted slightly to de-emphasize whatever caused it to guess 'cat' in that case; when the computer gets the answer right, the model's choice is reinforced. Over time, the machine learning models get more and more accurate.

Machine learning is great, because machine learning models are far more flexible than human-coded rules-based systems. In the cat example, I imagined coding rules related to ear size, fur pattern or eye position, and those are all fine ways to distinguish between cat and dog. However, they are not the only ways. Machine learning is good at automatically figuring out all of the best ways to distinguish between dog and cat, without having to have a human determine them and code them in advance.”

Monika: “Machine learning is a specific branch of artificial intelligence, which is a broad science that deals with making machines smarter. Specifically, machine learning is the ability to learn from and improve through data representing past experiences.”

 
Machine Learning
 

There’s a lot of buzz around data science and machine learning. What are some misconceptions about these fields? 

Andrew: “The term machine learning is being thrown around these days where tech is involved, and frequently used without much thought. A common misconception is that machine learning and artificial intelligence are one and the same, but machine learning is just a small part of artificial intelligence.

Machine learning is able to learn to predict a specific piece of data by training on large sets of data. However, it is unable to predict the future like most people think, but rather provide probabilities of certain events happening. A lot of human interaction and manual intervention is required in the machine-learning process, making it a lot less of an automated task than people think.”

Arlen: “I think there are a lot of misconceptions: for example, that machine-learning is close to 'general intelligence' (it's really not), or ‘as long as you have enough data, you can get answers’ (no, not always). But one of the major—and dangerous—misconceptions that I see is the idea that data, and the answers derived from data, are unbiased and neutral. I'm absolutely fascinated by machine learning and data science. I think it's amazing what computers can do and I think there are still a lot of advances waiting to be unlocked. At the same time, I'm deeply worried when I hear that machine-learning is being applied to hiring, or government processes, or credit ratings, or fraud prevention, etc.

The answers that come out of machine learning models are only as good as the data that is fed in. Unfortunately, a lot of the data that is fed in starts out with inherent biases, so the models that get trained on that data will reflect those biases. If you train a language processing model on random internet content, it will reflect all of the biases found. An example of this might be an HR department looking to ensure that their candidate-screening processes are neutral. Naturally, one would think that if a computer is making the decision between "good candidate" or "bad candidate", that's about as neutral as it gets. That's not the case, though. Because the model was trained on biased data, it will likely reflect and reinforce those biases.”

What are some new advances that machine learning can do that were once impossible? Anything interesting you’ve come across recently?

Andrew: “I have been blown away by the advancements in machine learning for image analysis, and the creativity applied to this realm. I've been learning a lot about generative adversarial networks (GANs), in which two neural networks face off and one attempts to trick the other. This has been applied in the creation of images and artwork for a variety of purposes, including music album covers. For example, a friend recently released a song, and instead of designing the artwork themselves or hiring an artist, they used GANs to construct the cover. An example of GANS at work can be found on Art Breeder, which uses the neural network framework to merge images of different objects and animals together.” 

 
 

Arlen: “It's been about a year since I first heard about it, but I'm still absolutely blown away by OpenAI's GPT2 text generation. The quality of text that a computer algorithm is capable of generating is absolutely amazing, and a little bit scary. It's amazing because until very recently, text-generation was an area where computers would very quickly (only a few words) spiral off into meaningless gibberish. For GPT2 to keep a consistent and readable thread through multiple paragraphs is astonishing.

They also have a variant of this where they apply the same algorithm to music. Again, 'state of the art' AI-generated music, until recently, was music that could hold itself together for a few bars before losing the plot.”

How do you feel that the music industry could change, as a result of data and data science and machine learning?

Andrew: “In regards to the recording and performance of music, the advancements in data analysis and machine learning have already made an impact in a variety of tasks. Functionality such as tempo modifications, song structure analysis, pitch manipulation, and other frequency analysis has made the creation and release of music much more accessible and affordable for individuals.

Data science has also made it easier for artists to promote and market their music specifically to an audience that would be curious and interested in listening to their releases. This allows artists to reach and connect to their fans in ways that were not possible before. As such, data science has affected and helped in all aspects of the music pipeline, from the creation of audio to the listening of produced music by fans.”

Read more Wavo interviews: