TLDR
- transfer learning will be the next driver of ML commercial success (domain adaption, fortuitous data, ...)
- use data sources like the typing behavior of people to label data (like finding word boundaries by pauses): fortuitous data
- problem with labeled data: too small amount and biased
- collaboration between data scientists and software engineers improves by increasing understanding on both sides and better tools and frameworks
- students are super important to deliver great stuff
- a good portion of gut feeling and trial & error is needed to reach the goal line
- different kinds of similarity exist => choose algorithm depending on that
---
From Friday, 30th June to the 2nd of July our team attended the PyData conference in Berlin. As the name suggests, it is about Data Science with the focus on Python as the data science language - based on my feeling becoming much more popular and powerful than R.
Quite a lot of the - really high quality - talks were about analyzing texts. In this context, once again, it was stressed that missing labeled data is a huge problem. Even worse, the labeled data that exists is biased: mostly created in the news space (much rarer in the social media space, for example) and by middle-aged, white man. This might result in the fact that algorithms learn social stereotypes or develop racist behavior; just remember the Microsoft bot that turned racist or facial-recognition software having problems to recognize black faces [
https://goo.gl/1xnKVT,
https://goo.gl/VyHgG8].
“Transfer learning will be -- after supervised learning -- the next driver of ML commercial success.”
- Andrew Ng
To overcome the obstacles of a too small amount of data in a certain field, different approaches (or at least their ideas) exist summarized as transfer learning.
One interesting idea is to use so-called fortuitous data, for example data collected by tracking the typing behavior of users to analyze where they make pauses and try to improve your word chunker with it (the idea is that people make pauses between words that do not necessarily belong together).
Transfer learning is key to build better conversational AIs which are able to utilize different language models, understand mixed language queries and can generalize dialogues from there special domain.
In the end, tech and algorithms to learn from and handle small data will be crucial to improve systems. Although rules can help to start and later probably refine your systems, it was stated that the time of creating flow charts is over: data driven is the future.
Some frameworks were presented, like H20 and one takeaway definitely was that you should not try to reinvent the wheel. The space of ML frameworks and libraries is already crowded. But it became also clear that a such frameworks are quite inflexible and not easily extendable. The strategy to go here is to learn from the frameworks, utilize what’s best but also building up an own research landscape in parallel to react and experiment fast.
When building up such a landscape, it is important to see and understand the different types of persons coming together, mainly data scientists and software engineers. Former want to experiment fast, crunching together results and insights and continue then. Latter want to build a sustainable system utilizing the research results. Tools like Jupyter and a growing ecosystem around it will definitely help to make collaboration in that field easier in the future and clean code starts to be articulated in the data science community as well.
“Any fool can write code that a computer can understand. Good programmer write code that humans can understand“ - Martin Fowler
In one talk, it was stated that most contribution on Gensim - a popular machine learning library containing text learning algorithms such as word2vec or doc2vec - is done by students; stressing the fact how important it is - in general as well as for companies - to tightly work together with universities and the bright minds of future generations!
Besides this extraordinary important point, the talk continued with explaining different kinds of similarities (like coast shore are similar because they are exchangeable; whereas clothes closet are similar in the sense of being related, but not interchangeable) and what algorithms fit those different purposes best.
This leads to another important point: trial & error and relying on gut feeling. Some quite successful Kaggle competition winners presented their work in different competitions where they ranked high and explained what kind of neural nets, algorithms and approaches they used. Besides a deep understanding of the field it became clear and they openly told that a lot of trying-out phases were involved without really knowing whether or not it works. This is because the algorithms heavily depend on the kind of data, how much data exist, how much labeled data there is etc. So, its always hard to tell in beforehand what will work - sometimes it’s kinda magic … and quite difficult to explain to others. Therefore, start with best practices and insights from others, which often will lead quite fast to already (very) satisfying results and then try to improve it by trying out different approaches and variations.
Of course, sometimes it makes sense to ignore the best practices from one domain, and maybe adapt those from another one (where we are somehow at the beginning of the blog again ;)). For example, usually RNNs were used for text analysis. For example, CNNs which come from the field of image recognition, can be applied on text with quite good results and much faster: CNNs can be parallelized in contrast to RNNs which reduced the training time from 2 days to 1 hour (of course, this also depends on your setup like hardware etc.)
In one talk, someone tried to connect Blockchain with AI. Very short, he explained how you can exchange machine learning models via Blockchain technology. Although this was a quite clumsy try to connect both terms, the overall conference was of high quality and worth its time. Also on a more lower level than depicted here, we took something home.
Just a few days ago, a university created a browser extension and asked for volunteers to install it and anonymously share there Google search results with them (German article:
http://www.spiegel.de/netzwelt/web/datenspende-forschungsprojekt-ueber-google-sucht-freiwillige-helf...). They want to compare them across a lot of people and find out, how your personal information affect the results you get from the search engine. Maybe such an approach could be used to generate labeled data within the company: in the sense of fortuitous data, anonymously track typing behavior etc. - perhaps something to think about!
With
❤️ from Berlin - Your Enterprise Mobility SAPAI Team!