Technology Blogs by Members
Explore a vibrant mix of technical expertise, industry insights, and tech buzz in member blogs covering SAP products, technology, and events. Get in the mix!
Showing results for 
Search instead for 
Did you mean: 

DG2: Birth month of English & Japanese footballers


After reading Christopher Kim's blog on birthdays and MLB players.

Can Your Birthday Help You Play Major League Baseball?

I thought I would try and look into the birthdays of English and Japanese footballers and how it relates to the school year in both countries. At the moment my family are living in England and my children follow the school year from September to mid July. My children also attend a Japanese school every Saturday and follow the Japanese academic year, which starts in April. Therefore I would see if I could get the data for both countries to see if a theory holds true that a disproportionate number of English & Japanese born footballers would be born in the first few months of the school year.

Data Collection

First issue would be where to get the data for the respective countries. Previously I had followed this tweet that every 'on the ball' event in the English Premier League would be made available (that would be big data) also including a reduced set of data for free. I did apply but did not hear anything back and now that offer has gone. So I had to look around for my dataset and my choice was Wikipedia. As using SPARQL you can query the data using online forms such as this one from DBpedia. More information about DBpedia can be found at this link

SPARQL queries

I have used SPARQL previously to query Wikipedia data that is available in DBpedia while I was living in Japan, although I have not used it for a few years, so now seemed a good opportunity to try it again.

English Football Players Query

My chosen wikipedia category would be
The assumption being that all the footballers listed in this category are English and may or may not have attended an English school. Some English players born outside of England have been left in the list as they may have attended an English school. My main objective being that they are English and to select the birthdays.

I split the SPARQL query into two as the format of the date of birth of the player was available in two different properties.

PREFIX dcterms: <>
PREFIX skos: <>
PREFIX geo: <>
PREFIX dbpedia2: <>
PREFIX dbpedia-owl: <>
SELECT ?player ?birthDate ?countryofbirth
?player dcterms:subject <>.
?player dbpprop:birthDate ?birthDate.
OPTIONAL { ?player dbpprop:countryofbirth ?countryofbirth }
FILTER (?birthDate  >= "19000101"^^xsd:date )

The second query is different on line 9 where the other property for birth date is queried.

PREFIX dcterms: <>
PREFIX skos: <>
PREFIX geo: <>
PREFIX dbpedia2: <>
PREFIX dbpedia-owl: <>
SELECT ?player ?birthDate ?countryofbirth
?player dcterms:subject <>.
?player dbpprop:dateofbirth ?birthDate.
OPTIONAL { ?player dbpprop:countryofbirth ?countryofbirth }
FILTER (?birthDate  >= "19000101"^^xsd:date )

If the above queries above are cut and pasted into the DBpedia endpoint here and executed, you should get the raw data I used.

DBpedia's form allows you to download the data in CSV format. I ran both queries and downloaded the CSV files from DBpedia and saved them locally.

Japanese Football Players Query

The same queries were used but changing the category to as per the line in bold below.

The assumption being that the footballers listed are Japanese (again may or may not have attended a Japanese school).

?player dcterms:subject <>.

Using the data in Lumira

First objective was to load both files into Lumira. As my files had the same format and headings I used the Add option as shown.

Then I used the Union Feature to merge the CSV files.

I then had both files in Lumira, I then filtered out the "unknown" and other dates not in a YYYY-MM-DD format.

I then converted the birthDate column to a Date in Lumira yyyy-MM-dd

In the new column I noticed I had left in some blank data, so I filtered this out with the Exclude empty values option selected on the filter.

As I was only interested in the month of birth I added a new column using Lumira's data manipulation feature.

Then I added a measure to the newly created month column as a count (all).

I renamed the new measure to Month and next step would be to visualise the data.

English Footballers Chart

For the theory to hold then the first few months of the English school year (September/9, October/10 & November/11) should have the highest number of players born in those months.

So the theory proves to be true :smile: with the sample data I collected.

Interesting how September (start of school year) is double the number of July (end of the school year).

As the data contains footballers born from the 1900s to the 1990s, I thought I would add a filter to show players born after 1975.

The visualisation still highlights the difference between the start and end of the school year.

Japanese Footballers Chart

There was less filtering required on the Japanese Footballer data but exactly the same Lumira functions followed as for the English players above.

For the theory to hold then the first few months of the Japanese school year (April/4, May/5 & June/6) should have the highest number of players born in those months.

So the theory is proven again :smile: , double the players in April the start of the school year as compared to the end of the school year.

Data Quality

I have used the raw data from the DBpedia queries from above. Only tidying up blank cells and removing data that did not have the correct format for the date of birth information. The data may not be complete for any particular football season or period so it is just a sample of data for Football players who happen to be available for the DBpedia SPARQL query.

One last thing to do

From my previous DG2 blog the one and only day and time for me to publish blog is Wednesday 13:00.

Update 2/7/2015

Women's World Cup Semi Final - England Vs Japan

The semi final between England and Japan generated some debate in my house.

So as a way to divert attention I mention again birthdays and footballers. It's not a wide ranging data set but the birthdays for the women footballers does not fit the theory.

Labels in this area