Blog posts tagged: stackoverflow
News and other things I find interesting
Last modified: Thursday, April 28, 2011
I thought it would be interesting to calculate the average age of users on each StackExchange site, and even more interesting to see each tag within those sites.
I did a caculation using the April 2011 data dump and came up with the following data.
I call the statistic the
Expected age of a tag because it is calculated using the Expected Value.
- The expected age of the whole StackOverflow site is ~30 years old.
- On StackOverlow the tag with the youngest expected age is 26 years old, the tag with the oldest is 36. I was surprised they were so close together.
- The site with the youngest users of the StackExchange network is: Gaming, then surprisingly Game dev, and Ask Ubuntu.
- The site with the oldest users of the StackExchange network is: Do It Yourself, followed by Photography, and then by Geographic Information Systems.
- A funny one, on ServerFault one of the tags with the oldest expected age is
old-hardware. Apparently older people know more about
old-hardwarethan anything else.
- I'm not sure if this is true, but perhaps the tags with younger ages are more cutting edge. For example vb6 and COBOL have ages of over 36 on Programmers SE. I don't think this assertion is true in general though.
And as for the other sites, the expected age is:
- Android: 30.02 years old.
- Apple: 30.50 years old.
- Ask Ubuntu: 28.08 years old.
- Cooking: 33.18 years old.
- Do It Yourself: 35.68
- Electronics: 32.01 years old.
- English Language and Usage: 32.22 years old.
- Game Development: 27.72 years old.
- Gaming: 27.39 years old.
- Geographic Information Systems: 33.34 years old.
- Mathematics: 30.19 years old.
- Photography: 34.01 years old.
- Programmers: 32.26 years old.
- Server Fault: 31.63 years old.
- Stack Apps: 28.31 years old.
- Stack Overflow: 30.48 years old.
- Statistical Analysis: 33.67 years old.
- Super User: 30.09 years old.
- TeX - LaTeX: 30.86 5years old.
- Theoretical Computer Science: 30.58 years old.
- Unix: 29.97 years old.
- Web Applications: 29.85 years old.
- Webmasters: 29.64 years old.
- Wordpress: 30.32 years old
You can see the per user tag data by clicking on the site name in the above list.
You could probably say that the StackExchange network could use younger contributors. I've said this before, but I think it would be advantageous for the StackExchange team to do some events at Universities. When I previously helped with some Microsoft events at University of Waterloo (Top Computer Science University in Canada, and one of the top in the world) several students didn't know what StackOverflow was.
How I made the calculations per tag
The below calculations were calculated with the April 2011 StackOverflow data dump.
What I calculated was the average age per tag each answer comes from for each StackExchange site.
To do this calculation I calculated the
Expected Age of each site.
Expected Age = Summation over each age X of: P(X) * X
P(X) is the probability that a user of age
X will answer a given question. You can calculate this probability by summing the number of answers by each age, divided by the total number of answers within that tag.
I also only considered the top 3000 tags. The top tags may not match up exactly since I only consider tags if the answerer has an age specified in their profile.
Other attempts at these stats
I initially tried to do this statistic by weighing each age by the reputation of each user, but it turned out to not generate interesting data. The problem was that the data was weighted heavily to only include the top 1% or so of users.
Limitations of this study
- Several users don't enter their age in their profile, so no answers from a user without an age specified counts.
- Users that are very young and users that are very old may be more unlikely to enter their age.
- Each user may be counted more than once, since I only count +1 for each age that answers a questions.
- Some users may be entering fake age values, although I ignored age values out of an acceptable range.
- We are talking about averages here, so this doesn't mean there aren't a lot of younger and older contributors.
For example if an average is 20 years old, there could be an equal amount of 10 and 30 year olds answering, or there could be only 20 year olds answering.
Last modified: Thursday, April 28, 2011
I refreshed my lists of social networking accounts (Twitter, LinkedIn, and Facebook) for StackExchange users. The lists are sorted by reputation and updated for the April 2011 data dump.
The data dumps surface every 2 months, so I will update the lists on my site around the same frequency.
This month 7 new sites appeared since they came out of the StackExchange beta:
- Do It Yourself
- Geographic Information Systems
For the first time there are over 20 StackExchange sites, and so I ran into a problem of Twitter only allowing you to host 20 lists. For each site I use an automatically maintained list of the top 500 users.
I tried to contact Twitter support to raise my limit of 20 lists but they could not help. I ended up getting my 2 sons to host the lists, so I have all automatic lists up and room for another 36 StackExchange sites. Thanks @linkbondy and @ronniebondy.
Last modified: Friday, April 22, 2011
But exactly which part of the new StackExchange Q&A sites are new users and which part of are shared from StackOverflow?
I mined the November 2010 data dump again and came up with some interesting stats.
To figure out the common percentage between StackOverflow and other sites, I created lists of in memory users for each site, and then figured out which users had the same email hash. A user across sites with the same email hash can be considered the same user.
I knew before doing this analysis that the percentage of common users to StackExchange users would be high because of the relative size of the StackOverflow community. I do fully expect for this 73% to decrease for future data dumps though and it will be interesting to re-run these stats and compare when the next data dump comes out.
Here are the statistics per site:
- Cooking: 2630 of 3155 in common (83.36%)
- Game Development: 2497 of 2938 in common (84.99%)
- Gaming: 3813 of 4418 in common (86.31%)
- Mathematics: 2162 of 2965 in common (72.92%)
- Photography: 1659 of 1916 in common (86.59%)
- Server Fault: 28770 of 38434 in common (74.86%)
- StackApps: 3656 of 3874 in common (94.37%)
- Statistical Analysis: 1298 of 1728 in common (75.12%)
- Super User: 31897 of 49157 in common (64.89%)
- Ubuntu: 3245 of 5090 in common (63.75%)
- WebApplications: 5575 of 6223 in common (89.59%)
- WebMasters: 2612 of 2820 in common (92.62%)
Total: 73.19% in common, 26.81% distinct
Of particular interest are the sites with a very high common percentage and some overlapping questions like the WebMasters StackExchange site.
What percentage of SO users come from the other sites? I checked the registration dates and a surprising 5% of SO accounts come from the other sites. This doesn't change the result much above though. Almost all of these 5% of distinct accounts come from Ask Ubuntu, Super User, and Server Fault.
Last modified: Friday, April 22, 2011
Wondering who to follow on twitter to keep up to date on technology?
I mined the latest StackOverflow (SO) data dump for all users with twitter accounts, then calculated each user's top tags based on most votes, and finally sorted the lists by user reputation.
The end result is that you can now easily stay connected with the people in your Stack Exchange community.
Here is a screenshot of what the SO list looks like, containing over 2300 Twitter accounts:
If you'd like to have your account listed in the directories, simply make sure your twitter account is linked somewhere in your profile, and I'll update these lists again on a future data dump.
I also mined the available Stack Exchange data dumps and extracted those twitter accounts as well.
You can view the lists here:
- Game Development
- Web Applications
- Statistical Analysis
- Created real twitter lists which are self updating via the Twitter API. You can access these twitter lists from the lists linked above. Note: Twitter has a limit of 500 users per list so I include only the top 500 users.
- Removed some meta tags for the "Known By" list such as "mistakes" so that I don't show anyone as being known for mistakes :)
- Fixed a bug with non StackOverflow sites linking to the StackOverflow user pages.
- Added better parsing to find twitter URLs
- Added filtering of bad twitter URLs
- Removed invalid twitter accounts that don't actually exist anymore
- Added followers count, following count, last tweet date, and twitter description
Last modified: Friday, November 16, 2012
Update December 11, 2011: StackOverflow recently implemented removing nofollow links on high rated posts. It is very strict, but it's a start.
Update November 12, 2012: I don't know how many answers have nofollow removed, but I think it's a very, very, very small number. I'd bet much less than 0.1%.
For example see this accepted answer with 74 upvotes from a user with almost 100k reputation. The links are to MSDN (which is probably not spam by definition) and to a quoted source on techbubbles.com.
I personally chose to stop answering questions in the same capacity as I used to for the reasons outlined in this post.
Update November 16, 2012: The link mentioned on November 12th was fixed by StackOverflow's Kevin Montrose. I'm not sure if this had a wide effect on less strict nofollow removal, or if it was special cased to remove the nofollow.
Everyone with any exposure to HTML knows what a link element looks like:
<a href="http://wwww.brianbondy.com">My Website</a>
This is a link called
My Website with a link target of
Links like this can be easily marked up with a
rel attribute to add extra information about the link.
One particular usage of the
rel attribute is
<a rel="nofollow" href="http://wwww.brianbondy.com">My Website</a>
rel=nofollow attribute and value is used to inform a search engine that the link's target should not benefit in ranking from search engines.
What problem is nofollow supposed to solve?
nofollow was supposed to allow search engines to detect links on a page which could be subject to spam.
A perfect example of where this is useful would be on a blog site where comments are allowed.
nofollow convention was created because in theory, if you take away the PageRank benefit from a link's target, spammers should feel discouraged from spamming their links on random blogs.
Who came up with the nofollow convention and who follows it?
Members of Google originally came up with
nofollow mainly for blogger.com in 2005.
This convention of not affecting page rank of the target was adopted by Google in 2005 and later by Yahoo and Bing as well.
Each search engine has its own interpretation of
nofollow but in general they all agree that PageRank should not be attributed to the link's target.
Does nofollow work?
nofollow doesn't solve the problem it was intended to solve.
Spammers still want direct clicks into their site, and they have no guarantee that search engines will actually do like they say and ignore the links, so spamming is still useful to the spammers.
Spammers also know that on many sites the content of the site is duplicated on other domains, sometimes these duplicated sites do not use the
nofollow has since its original inception tried to be repurposed to be used for paid advertising links. However this affects an entire market of people who pay for links so that they get the benefit of SEO.
Internally to a site, for links inside that site
What is nofollow abuse?
nofollow abuse is when a site uses
nofollow not to indicate potential spam, but instead for its own selfish benefit.
In particular, if a site marks a link as
nofollow when credit is due to the attributed source, and the site knows the link is not spam, then you have
Does abusing nofollow hurt the Internet?
nofollow means that the sites that should get credit for good content no longer are getting credit for good content.
This will in turn mean that you will receive search results that aren't the best possible ones.
Why do sites abuse nofollow?
The problem is that many sites want to be the highest rated site on searches from search engines. When the abusing site refers to sources, they will always mark the links to other sources with
That way the abusing sites will have a better chance of coming up in searches before the people they are quoting and referring to.
Sites do this for selfish benefit and also because they believe their site has the best content available on the Internet. If it is the best content on the Internet though they shouldn't need to do dirty tricks with
Who abuses nofollow?
Many major players do, and many major players do not.
Particularly responsible are those sites with a reputation system in place.
The site in particular that I want to talk about is stackoverflow.com.
Stackoverflow is a site for programming Q&A and is also the same framework used by many other Q&A sites on a variety of other subjects called StackExchange.
Stackoverflow and the entire StackExchange network are some of the biggest abusers of
One of the co-founders and lead developers Jeff Atwood has stated:
You get a followed link in the "website" field of your user profile at 2000 reputation. Beyond that, everything outside the network is nofollowed as a simple matter of standard policy. Exactly like, and for all the same reasons as, Wikipedia.
The heart of the abuse though doesn't come from attributing user pages with a bonus link to their website.
The abuse comes into play when questions and answers link to references that their answers are based on. Links highlighted in orange indicate nofollow abuse.
In Jeff's quote above, he doesn't address the fact that Wikipedia and Stackoverflow are very different sites.
Wikipedia organizes it's content by topic only. Stackoverflow organizes it's content first by topic, and then by author. And each author has a reputation which could be used to determine if their answer is trustworthy. Each answer also gets up-votes which could be used to determine if the answer is trustworthy. Each open question is also not a closed question which counts as well.
nofollow to all of their external links, but this does not make them right. They are almost as guilty as Stackoverflow.
Stackoverflow is even more guilty because they have a reputation system in place and they know that the users with enough reputation will not spam their site.
Stackoverflow does not want to compete with other sites over Google ranking positions, this is because over 87% of their incoming traffic comes from Google searches as of December 15th, 2010.
More on Stackoverflow nofollow abuse
Jon Skeet, the #1 user on Stackoverflow has 250k reputation, he is immune to many things; however, the links he posts have
You can see this on his about page, on his questions, on his answers, and on his comments.
However within 1 hour of a meta post about a bug with nofollow not being added Status Completed!
Sponsored tags on Stackoverflow have
nofollow, this was mentioned above as to Google trying to repurpose
The problem is made worse in that all StackExchange sites behave in the same way. And also if you reference Stackoverflow from a StackExchange network they will actually remove the
Stackoverflow is a nofollow Hypocrite
On this post entitled Attribution Required Jeff Atwood explains how the content that their community creates, if used, must be linked without
By “directly”, I mean each hyperlink must point directly to our domain, and not use a tinyurl or any other form of obfuscation or redirection. Furthermore, the links must not be nofollowed.
This stance is good, it protects the content of the well deserved writers of Stackoverflow such as myself. I am within the top 50 users and hence have spent a lot of my time writing great answers. But these answers are not 100% of my own creation, they often build upon other people and other answers from other sites. It's simply wrong that these other sites are not attributed page rank when I link to them.
When Stackoverflow builds their answers upon other great articles, they fully abuse
nofollow. Even when an answer is a complete copy of another page with a reference link. The link will be
However when Stackoverflow benefits from not using
nofollow they make sure that you don't abuse
nofollow. Stackoverflow will always strip
nofollow if the link you post is on the Stackoverflow domain or StackExchange network, but it will not strip it for any other attributed site.
Another place where they link back to their site is via the StackExchange flair which they want people to include on their websites. These links of course do not contain a
Stackoverflow has always prided themselves as being less evil than experts-exchange.com. And in many ways they are less evil. But one area where this is not true is that
nofollow is not used on all links in experts-exchange..
Do all sites abuse nofollow?
Slashdot is an example of a website which does not abuse
It is a site which Stackoverflow should look to for inspiration in this respect.
Slashdot has per user karma and they will selectively remove
nofollow from trusted sources.
I verified this for both their comments and their article posts.
How can we solve this problem?
One thing we can do is raise awareness of
nofollow abuse. That way the offending sites may eventually get the point of not abusing
I would hope that search engines will be powerful enough to not only ignore
nofollow from abusing sites, but even punish these sites for trying to abuse the convention.