Hey Neville,

When was the last time someone wrote you a custom HTML letter? ;)

I am excited to show you what we do with the data!

I was intrigued by your challenge, and kept going down and down the rabbit hole. You gave me at least three Key Performance Indicators (tweets, facebook shares, and OkDork comments), and I did at least three different experiments for each KPI. It took somewhat longer than just pressing the "GO" button, because we expanded your data with new metrics.

Here is the data we got:

OkDork 2014 Data Points


Data from OKDork posts published in 2014.

If you don't feel like reading you may jump down to the graphs, but I am going to start with a summary.

How does this analysis help become a better blogger?

So far we can make the following recommendations based on the data:
  • Make longer more informative headlines
  • Post at the beginning of the week
  • Images help for twitter, but don't really help for facebook.
  • A sheer presence of a video will not make your post viral.
  • Numbers in a headline will not guarantee the popularity - this metric was never selected as important.
  • If the headline is short - keep your story short as well (~1200 words max). Longer headlines go best with a longer story - you may go wild here.
  • Be emotional. Super-neutral posts do not get many shares (with two exceptions)
  • Make your story visual. Using the word "see" helps.
  • When making a book review, don't hope to get many shares. (it's a pity!)
  • Noah did best (in terms of shares on facebook and twitter) when his posts belonged to these word segments: (1) content, post, blog, headline, page, linkedin, share, social, article, kissmetrics and (2) email, client, job, people, hiring, work, noah, designer, freelancer, person. Maybe you will do too...

Conclusions:

0. We want more data

I think we have an interesting story here, but I would love to repeat the analysis for a much larger set of posts. The textual analysis will get crispier.

1. Playing with your raw data did not give earth-shattering results at first

The data you gave me did not have sufficient information about the post popularity - we could create predictive models, but they all had a poor accuracy (around 60% correlation with the real shares, which is weak).

2. Adding the weekday to the list of metrics helped a lot!

We first added to your data a few additional metrics on the post stats:

  1. Numbers in a Headline: Whether or not the headline has any numbers in it.
  2. Number of words in a headline.
  3. Week day number: The day of the week when the post was published, from 1 being Monday to 7 being Sunday.

Then we run the experiments to predict the number of facebook and twitter shares allowing the following metrics: numbers in a headline, headline length, word count in the post, # images, whether or not there is a video, week day, and words in a headline.

The weekday really rocked it, because we got models with 83-85% correlation (you can play with the posts below). All was cool, but the models were not capturing the viral posts (with a gigantic number of shares). You will see it in the graphs. For both, facebook and twitter shares models are predicting the posts reasonably, but fail to see the 'viral' posts.

The reason for this is likely that the actual content of the posts was related to their epic popularity (I know - no rocket science). So, we decided to dig a little deeper.

3. Adding sentiment information, and segmenting posts into distinct groups helped further.

To dig deeper we've analyzed content of the posts.

These metrics were added to your data in the next iteration:

  1. We've done the sentiment analysis of the content of each post and ranked them into POSITIVE, NEGATIVE, and NEUTRAL.
  2. Also added the metric NEUTRALITY - whether or not the post is neutral.
  3. Added the number of external links in the post. (IDEALLY, if you have acces to more posts we would add the information on whether or not the post is referencing some other external epic posts.) This number of links turned out to be important for facebook shares and only to a small extent.
  4. We also did a super-simple analysis of text (natural language processing), and identified five distinct clusters in the posts. They are pretty interesting, because they have very different average shares for each metric. You can find this table below.

All these extra metrics were added to the table and fed to DataStories again. The results were MUCH better. We could more accurately predict the popularity of all posts, but the 'epic' posts were still under-predicted.

It helps to NOT BE NEUTRAL. Only two highly popular posts are neutral: A (Proven) Freelancer’s Guide to Growing Your Business and Are things happening to you or are you making things happen? And these two posts were still hard to predict accurately.

4. Adding the frequencies for the top 22 words did it - we could almost perfectly predict the popularity of all posts.

I only don't like that the usage of word "two" was labeled as important (the less times you use it - the better for shares). This is an artefact of a small sample of posts that we have. I will remodel everything exclusing words "one" and "two", but thought will show you this for now. lIf we repeat the process on a larger set - the word selection will be crisper.

Machine learning rules!

FACEBOOK SHARES can be predicted with 83% correlation accuracy only using 3 post statistics

Note, that five 'epic' posts cannot be predicted well. HOVER over the graph for more info:

TWITTER SHARES can be predicted with 85% correlation accuracy using 4 post statistics. Except six 'epic' posts. Hover over for more info:

Longer HEADLINE and early WEEKDAY always help, but WORD COUNT is special

The rules for a higher number for Twitter are quite simple:

  • A longer headline helps a lot.
  • Publishing posts early in the week helps a lot.
  • More images helps a little.
  • The total word count of best shared posts really depends on the headline: Posts with shorter headline better be shorter to be shareable (have smaller word count). Posts with long headlines have will be better shared if they are longer.

For posts with a short headline of 20 characters the optimal word count is no more than 1200 words:

Predicted Tweets. The vertical axis shows five categories of the predicted number of tweets: [1-10, 11-100, 101-1000, 1,001-10,000].

The optimal word count in a post with a short headline should be no more than 1200 words

For posts with a long headline of 70 characters a bigger word count helps:


The optimal word count in a post with a long headline should also be big.

The WEEK DAY being so important is strange but true:

The actual number of posts per week day in 2014 was the following: 7 posts on Mondays, 12 on Tuesdays, 10 on Wednesdays, one on a Thursday, two on Fridays, 6 on Saturdays, and three on Sundays.

It looks like Noah's favorite days to post were Tuesday and Wednesday, but we see that all posts with more than 500 twitter shares were published on Monday and Tuesday (with one exception).


The optimal word count in a post with a long headline should also be big.

Another great way of looking at the data is sorting the rows of the table by the factor of interest.

Look at the plots of data columns sorted by Twitter shares:


Good visual inspection of the data with the right tools always helps before doing any predictive analytics

BOOK REVIEWS do not get many shares

OkDork published five posts with a book review in 2014. They were much shorter on average (740 words vs. an average of 2400 words for other posts). All of them had precisely one image. The average length of the headline was very similar with an average length for other posts - 47 characters. However, the book reviews got considerably less shares and comments.

Let's hope, Noah will keep doing them despite the low share volume, because they are great!

HOVER over the graph to see an AVERAGE metric for BOOK REVIEW posts and other posts:

Segmenting post content by textual analysis sheds light on shares and viral posts!

We pulled in the text from all post pages, and applied some simplest natural language processing algorithms to it.

We stripped it off to the contents of the posts, removed all stop words like "is","the","a","are", etc., converted words to infinitive forms forms or nominative cases for each word. This gave us 5600 unique words. Then we selected the 3000 most frequent ones and first tried to understand what are the distinct topics representing the overall content of OkDork posts in 2014.

You may consider the topics as the shortest possible summary of all posts of 2014.

Here are the topics sorted by importance:

  1. content post blog page headline share linkedin social article link
  2. email marketing subscriber get want new course summer noah okdork
  3. ad retargeting campaign advertising step targeting facebook target roi click
  4. business thing people work get time make taco yogurt baby
  5. book review amazon people anxiety kindle promotion feedback smartcuts copy

The topics are great, but they do not give us a clear way to segment the posts. So, we took all posts and the frequencies of all unique words in them and clustered the posts into five groups.

We named the resulting segments A,B,C,D,E,F and calculated the average stats for them. The results are interesting!.

In the table below we show the average metric values for each segment:


Post segments really capture super-popular posts!

Here are the top 10 words representing each segment:

Name Tweets Top 10 Words
A 30 taco, shirt, minute, learned, favorite, lesson, started, get, poster, didnt
B 103 email, marketing, get, new, subscriber, okdork, want, post, reader, business
C 117 ad, car, retargeting, campaign, advertising, step, targeting, facebook, target, video
D 361 book, people, thing, business, time, yogurt, sugar, fat, review, take
E 1441 content, post, blog, headline, page, linkedin, share, social, article, kissmetrics
F 1554 email, client, job, people, hiring, work, noah, designer, freelancer, person

What's the conclusion?

It looks like the OkDork followers love everything Noah publishes equally. The top segments for them in terms of the number of comments are lessons & tacos (A), content & headlines (E), and hiring & freelancers (F).

Social media, however, really focuses on the last two segments E & F only!

Guess What? All seven posts with more than 1500 tweets fall into two last segments!

With Sentiment and word cluster information FACEBOOK SHARES can be predicted with 93% correlation accuracy only using 5 metrics

Note, that we better predict 'epic' posts. HOVER over the graph for more info:

With Sentiment and word cluster information TWITTER SHARES can be predicted with 92% correlation accuracy only using 6 metrics

Note, that the predictions for 'epic' posts have improved. HOVER over the graph for more info:

With sentiment, word clusters and core word counts FACEBOOK SHARES can be predicted with 99% correlation accuracy (!) only using five post statistics

Note, that all posts, including 'epic' ones are predicted well. HOVER over the graph for more info:

With Sentiment and word cluster information TWITTER SHARES can be predicted with 99.9% correlation accuracy only using five metrics

Note, that the predictions for 'epic' posts are great! HOVER over the graph for more info: