sterling silver bracelet by David Bowman Chain of Beauty
This sterling silver and turquoise bracelet was made by David Bowman and is for sale in his Etsy shop, Chain of Beauty

Photo courtesy of David Bowman, Chain of Beauty.

We’ve been doing a lot of reading about Etsy’s new Star Seller program which rolled out at the start of September. There’s been a lot of impassioned responses from sellers, and one stood out for us. It was originally posted on the Etsy forums and authored by David Bowman, a chain jewelry artist who also has extensive experience in communications and is an instructor at the University of New Mexico. He goes through each aspect of the rollout and examines the pros and cons in a methodical way that we appreciate. 

We reached out to David and he’s granted us permission to republish his forum post. Here’s David:

I understand what Etsy is trying to do with this program. I have conducted research and evaluations of many types of programs for more than 25 years, and it’s clear to me why this rating system is designed as it is.

Let’s say you want to rate customer service. What would you analyze? Message response, shipping times, and buyer feedback are three good categories of information to look at.

Once you decide what to analyze, the next step is to determine what data are available and what data can be generated to analyze. Basically, how would you know if message response, shipping times, and feedback are top quality? What data would you analyze? What are the possible data sources and what can you learn from those data? Then you have to determine whether or not the analysis process is valid, meaning it actually determines what you are trying to learn.

This is where the Etsy star rating system runs into problems. It is not a valid process and does not determine whether or not the seller has good customer service.

It is clear to me that in designing this system, Etsy did not employ statisticians or research scientists. The Star Sellers program would be laughed out of any research journal or nearly any credible trade publication. Try to publish a dissertation using this methodology, and you will not receive your degree. In the program world, a proposal that espoused these practices would never get funded.

Why do I say these things about the Star Sellers system? Let’s look at the specifics, and I’ll explain.

Shipping Time

You can’t actually see the seller send the products, so what can you look at? One possible data point is whether or not there is a tracking number. This is not the same as delivery, but if there is a tracking number, then the product will likely be shipped because delivery costs have been paid. In this way, tracking numbers make a good indicator of likely delivery for a particular product.

On the other hand, not all delivery methods will generate a tracking number and not all products are worth tracking. So, while this is a good indicator that any particular product will be delivered, you cannot use the absence of a tracking number to indicate that the product will not be shipped or shipped on time. You can infer that a tracking number indicates likely delivery, but you cannot infer anything from a lack of a tracking number. It is not a valid process because it cannot be used to evaluate a seller’s customer service level.

This is the flaw in Etsy’s logic because they are using the lack of a number to indicate late delivery. The data cannot be used in this way.

Etsy is making assumptions that are not valid because the lack of a tracking number is not an indication of a customer service level. There is only one thing you can conclude from the lack of a tracking number: the seller didn’t enter a tracking number in the Etsy system. Nothing else can be concluded.

Message response

Replying to buyers’ messages about a product will demonstrate the seller’s responsiveness to buyers. It’s a good metric.

However, for this metric to be used, you would have to separate messages that merit a response from those that do not. If you cannot do that, then you cannot use this metric. By applying it to all messages without differentiating messages needing a response from those that don’t, the measure is not valid.

Along with this, Etsy’s description of the measure specifically indicates they assess how sellers respond to buyers’ messages. If Etsy cannot determine which messages are from buyers, and further determine which messages merit a response, the findings on responses are not valid.

The analysis does not consider whether or not the seller responded to a buyer’s message, only that the seller did not respond to a message, whether or not it was from a buyer and whether or not it needed a response. Basically, the process used for analysis doesn’t measure what is intended, which is the definition of an invalid measure.

It is also not reliable because it will not indicate whether or not the seller responded to a buyer’s question that needed a response, and it is not valid because it does not actually measure a seller’s service to buyers.

There is one additional threat to reliability: quantity. Unless there are enough messages, you cannot make any general conclusions from the message responses. Let’s say a seller only has 3 messages. Each message has an excessive amount of weight on the final result and can skew the results wildly. If one of those 3 messages was spam or contained a suspicious link and the seller simply deleted it, the entire rating can drop below the required level. On the other hand, if you have 300 messages, one anomaly in responding doesn’t have much effect.

Five star ratings

This is actually a pretty good indicator of customer service because the data come from the customers based on their impressions of the customer service, among other factors, such as product quality and utility. On the surface, it looks like a valid measure. It is reasonable to assume that five stars are given by satisfied customers and that one star is given by unhappy customers. (In some rare cases where the buyer gets confused, unhappy buyers may give a high number of stars and vice versa.)

But there is a problem: reliability. Buyers use the rating stars inconsistently because they don’t use the same definition for what a certain number of stars means. Basically, a four or five star review doesn’t mean the same thing to each buyer. One buyer might think that 5 stars is only for exceptional, better than expected product and service, but that 4 stars is for simply acceptable service and product. Another buyer might think 3 stars is for barely acceptable, 4 stars is amazing, and 5 stars is only for faster than expected delivery, super low cost, and freebies included in the package. In some cases, one person’s 4-star rating may express greater satisfaction than another person’s 5-star rating. Without a consistent definition for a number of stars, the process is not reliable.

The star rating system is also not a valid measure of customer service because the stars are not only for customer service but for other factors as well.

If you take a look at associated comments, you see that they address a range of topics, including shipping time, product quality, packaging, prices, and, yes, customer service. Then, you have those that reflect a mistaken idea about the nature or type of product they expected to receive, perhaps because they didn’t read the description carefully.

The point is this: the stars are not specific to customer service, so using the star ratings in that way is not valid. The data simply cannot be used in that way.

On the other hand, the star ratings could, reasonably, be divided into two groups, and likely produce general impressions: greater than 3 stars and fewer than 3 stars, with 3 stars having no value. In this way, the findings would approximate customer overall satisfaction more closely, and some of the individual difference of definition would be averaged out by this general grouping.

Even better would be a sliding scale in which 5 stars is worth a certain amount of points, 4 stars fewer points, etc., down to 1 star. With a bit of simple math, the total points could be a percentage of 100 points. This is a very common process in evaluation.

The final point about using ratings was addressed previously: quantity. Without a sufficient quantity of ratings, any anomalous rating would skew the results significantly, whereas with a large quantity of ratings, a “one-off” rating would have little effect on the result.

The way the stars are being used now is nonsense, given that a 4-star rating has the same deleterious effect on the analysis result as a 1-star rating.

Sales quantity and value

Actually, Etsy using sales quantity makes a lot of sense. Without a sufficient number of sales, you can’t make any general conclusions about a seller’s service. (Note, it is unfortunate that this same concept isn’t being applied to message and star ratings.) Etsy hasn’t executed this measure well, but the measure, itself, makes sense from an evaluative perspective. Without enough data points, you can’t draw any reasonable conclusions.

On the other hand, a minimum of 10 sales seems arbitrary and is likely too low to reach any reliable conclusions. Twenty, thirty, or more would be more reasonable, but that would prevent many, many good sellers from ever reaching the top ratings needed for the badge. With this low quantity, the measure simply isn’t reliable.

Additionally, for some sellers, especially those making custom, high-end, and labor intensive, long-term products, 10 sales may be nearly impossible to achieve, such as those who make furniture or custom wedding dresses.

Sales value, in contrast, doesn’t make sense given that the range of product prices has a very large standard deviation – the range is just too wide. A seller with a $1,000 product will meet the value criterion in one sale. The same seller may never reach the quantity metric because the items simply take too long to make. A seller of low-cost items, say around $1.99 digital products, would need more than 150 sales to reach the necessary level.

This measure is also based on the faulty assumption that sellers who make more money are somehow better sellers–without taking into account the value of the products sold. A seller with 30 $10 sales isn’t necessarily more worthy of a badge than a seller with 149 $1.99 sales

Overall, neither quantity nor aggregate value can work as measures for customer service. On one hand, they may prevent great sellers of low-cost items, and great sellers of high-value items from ever obtaining the badge.

As a side note: sellers of trademark infringing items in the $10 – $15 range will likely reach both the quantity and value criteria rather easily and quickly.

And the fact is that neither one of these measures directly reflects customer service or, even, product quality. They are unreliable and invalid measures.

Bottom line

The manner in which this entire system is being implemented is faulty: both invalid and unreliable, as shown above.

Now, I have to consider the value of the entire idea. From an evaluation perspective, it attempts to replicate a measure that is already in place: the star rating and review system. It offers a judgment on sellers that provides less information, and less usable information, than the current system.

It will also lead to misunderstanding due to some shops (most?) not receiving the badge but without informing the shoppers why the badge wasn’t received. For example, a shop might not receive the badge because the seller doesn’t respond to messages quickly. A particular buyer might not care about that measure because he or she won’t be sending any messages.

It can also lead to a misconception that the seller is less trustworthy than a seller with a badge, even if the non-badge seller generally provides great service. This misconception can significantly hurt a great seller, simply because the seller missed a criterion by an insignificant amount, say 94% on some measure rather than 95%. The only thing a shopper knows is that the shop doesn’t have a badge.

Even if the above-mentioned problems with reliability and validity are addressed, this system will hurt many great sellers, while providing potential benefits to sellers who violate Etsy policies. A single seller with trademark infringing products who receives a badge will invalidate the value and integrity of the entire system.

One big “elephant in the room” is Etsy taking on the role of business manager for individual sellers. I have to question why Etsy believes it has the responsibility to tell sellers how to conduct their business. The marketplace already rewards good sellers and punishes bad sellers by affecting repeat buyers and the star rating and review system.

Before Etsy considers it has the knowledge, expertise, and responsibility to micro-manage individual sellers, it first must fix problems with the services it sells: faulty search, poor responsiveness to its customers, troublesome shipping products, and the rampant reselling, counterfeiting, and trademark infringements that are coming to characterize the entire platform. Until Etsy improves its own services, Etsy does not have the moral authority to tell others how to treat their customers.

Finally, I expect that this system will eventually affect search ranking and be a search filter. Yet, to judge sellers based on invalid an unreliable processes, and then to hurt their sales potential based on the results, is both unconscionable and unethical.

In the medium to long term, good shops will close, and shops violating Etsy policies will benefit.

Given the numerous significant problems to be addressed on the Etsy platform, time and effort would be better spent improving the platform rather than finding a new ways to limit sales from passionate, responsible, and customer-focused sellers.

My conclusion: the entire system needs to be canceled. Not fixed, improved, or tweaked–but canceled.

That’s my two cents. I would love to hear your thoughts on the way sellers are being judges in the Star Seller program.

David Bowman

David Bowman


David Bowman is an award-winning jeweler. He learned silversmithing on the Navajo Reservation and, in 2006, turned his attention to chainmail. David is also the owner and chief editor of Precise Edit. He serves as a communications consultant, highly rated University of New Mexico writing instructor, grant writer, and program consultant. He is the author of 6 writing guides.

Pin It on Pinterest

Share This