Saturday, June 1, 2013

Blexbot Content Scraper is Really Nielsen Media Research

I had great difficulty finding detailed information online about an IP address, 216.176.177.162, that appeared in my site log over ten thousand times. But now that IP address is cold busted. It belongs to Nielsen Media Research, a pack of content scrapers. They do not wish to be identified as such, and so they lie, and call themselves a random name like Blexbot. Tomorrow they will be clexbot, and the day after that, wmu-bot. What are Content Scrapers? They are greedy bots that attempt to grab every piece of data from a given site. Interesting bits of this data are then grouped together and sold to companies, governments, or individuals. In short, they grab content and try to profit from it. They do not send traffic. They should be banned by every site, no question about it.

Lookie what the scumbags are doing on a Wordpress site:
216.176.177.162 - - [29/May/2013:06:21:13 -0800] "GET /password HTTP/1.1" 404 2438 "-" "BLEXBot"

216.176.177.162 - - [29/May/2013:06:21:16 -0800] "GET /signup?context=webintent HTTP/1.1" 404 2438 "-" "BLEXBot"

216.176.177.162 - - [29/May/2013:06:21:18 -0800] "GET /reg/join HTTP/1.1" 404 2401 "-" "BLEXBot"

216.176.177.162 - - [29/May/2013:06:21:21 -0800] "GET /forgot_password HTTP/1.1" 404 2438 "-" "BLEXBot"

They're not just content scrapers, they're malicious hackers. Those 404's you see above? That code means they're making up links as they go along, running them up the flag pole to see if anybody salutes. Meanwhile, the web admin gets to have fun wondering what's wrong with his web site that all of these 404 errors are popping up. (There were many more than just the above examples.)

34 comments:

Dan Atkinson said...

The code does not mean that they're making up links as they go along. Also, if they were malicious, I'd be expecting them to send POST requests rather than GET requests. These ones look pretty benign and harmless.

Also, these links could easily have been links present on the site in question, and therefore was just doing what a crawler bot does by spidering the links.

If you're really worried about being scraped, start by making use of the robots.txt file. If they don't obey the robots.txt file, then you can take further action, such as make an abuse complain to their host/registrar.

igor said...

Counsel for the defence has spoken.

Whether Blexbot is malicious or buggy may be a matter of opinion. However there is one irrefutable fact. The links were not "present on the site in question." They were manufactured by Blexbot. Whether that was due to malice or mere incompetence may remain a matter of opinion.

My opinion remains unchanged. You have not told me anything new.

It is not incumbent on any web admin to permit bots any amount of access. If an admin wants to spend all day holding court with bots and presuming innocence before guilt, and wondering why the bots don't always obey robots.txt, he is welcome to do so, but I find more relief from .htaccess. A reader can probably find Blexbot's IP address in my blacklist, if I've updated it.

Consider the scenario of a site on shared hosting with finite bandwidth and cpu. A bot comes along and sucks down ten thousand pages, gigabytes of data, triggering 404 errors along the way that will show up in the admin log. That is a malicious waste of time and resources in itself, whether the bot intends to steal the site content, which is highly likely, or is probing for security vulnerabilities or who knows what else--and who cares.

The bots I like and want are the major search engines and those related to social media and things I have subscribed to and invited to visit my site. Period.

Anonymous said...

I got here after searching about BLEXBot because they were crawling our site and hitting the same URL 20k times or so. They did this on a handful of pages, racking up several hundred thousand page views yesterday alone.

They haven't pulled the robots.txt file either.

It's a broken crawler attempting to do who knows what. We blocked it. If indeed it is Nielsen Media Research they should identify their bot as such so we could actually contact them about their issue prior to blocking them, but they obviously don't care.

igor said...

I am for efficiency and for getting my money's worth out of my hosting plan. Without using a blacklist, the typical site is going to see up to 90% of their bandwidth and cpu wasted on greedy, selfish and poorly programmed bots.

Anonymous said...

Came across this via google for BLEXBot useragent, thanks for the info!

Over the last few days we've been getting scraped by a bot identifying itself as BLEXBot; which requests the same invalid url (one that definitely wasn't linked anywhere on our site) around 8000 times in the early hours of the morning.

Incidentallly, for us, requests were coming from 198.143.187.114 looks like they are gettig some new IPs to blacklist via SingleHop servers.

igor said...

Thanks for the info, but that IP is already in my blacklist. I deny from 198.143.128.0/18.

I'll update the copy of my blacklist here on the blog one of these days. You are welcome to use it.

Anonymous said...

BLEXBot created 20.823 pageviews in 2 days.
You can add 198.143.158.178 to your blacklist.

igor said...

Thank you for the suggestion. I added 198.143.128.0/18 just in case they have rented multiple IPs in the same range.

iJoost said...

Fetches more pages than Google and Bing together. At a rate of one per second. Waste of resources...
*PLOINK*

Anonymous said...

we're getting 23k hits a day from 198.143.187.114. Bye bye blexbot

Unknown said...

I just encountered blexbot for the first time today and it did 32k pageviews in 24 hrs. I think its time to block the user agent and ban the ip range.

Anonymous said...

This bot hit one of my picture galleries hard over the past week. Not just repeatedly but enough times simultaneously that it started causing SQL errors and my host shut down my sites to prevent overload. The bot is not only buggy but very destructive and disruptive. Thanks for the heads up!

Known IP: 198.143.158.178

Anonymous said...

Yes, this is a bad bot. Hammered the hell out of my server. I blocked them as well. 198.143.158.34

I'm guessing we can block 198.143.158.0/24 with no adverse effects judging from anon's post above...

Anonymous said...

Yep. Used 400k of bandwidth in one day on a site that usually sees 100k a month!
198.143.158.34 this time.

Anonymous said...

They brought down one of my webservers for about 5 minutes with their ruthless high speed scanning.

I don't play nice with hosting companies that allow this kind of continued action. IPTABLE Blocked:
108.178.0.0/16
198.143.0.0/16

igor said...

Where did you find this information? My information is different.

Unknown said...

Over the last 3 days 198.143.158.34 has used almost 1 gig of bandwidth trying to scrape my site.

198.143.158.34 - - [10/Aug/2013:14:59:52 +0000] "GET /f/ucp.php?mode=login&sid=383c0a61af6fb8929ddcd40cd33ad997 HTTP/1.0" 404 1224 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup.com/crawler.html)"

Anonymous said...

108.178.61.130

3.3k hits the other day

Unknown said...

Here the same problem,

198.143.187.122 - - [20/Aug/2013:08:37:56 -0300] "GET /customer/account/forgotpassword/ HTTP/1.1" 200 6842 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup.com/crawler.html)"

Anonymous said...

Indeed, 824MB of bandwidth has been sucked out of my site in a 48 hour span by BLEXBot from 108.178.53.146 and it may not be a coincidence that 85.25.134.59 has pretended to visit my site during this same time period. That is more bandwidth in 48 hours than the combined total bandwidth used in the previous year. Gee thanks, right?!?

Backtracking that last IP from my logs tells me that MacInroy Privacy Auditors pretty much admits that they have attempted to compromise my site security with the sole purpose of getting the attention of an admin so they could pitch some security services. Most of the information they claim to have audited is completely incorrect and the rest came from whois data.

Cheers, igor.

Anonymous said...

There is also the fact that I have a number of failed WP login attempts from a BLEXbot identified IP. Sure they're not posting on those pages, but they certainly ARE POSTing to my WP install, obviously hoping for the best. I have had 12 bad login attempts from this IP in the past hour.

Anonymous said...

It´s a seo tool from webmeup

Anonymous said...

If you think writing a robots.txt file and hoping for the best is enough to keep bad bots at bay, good luck with defending your good web contents from those evil, rampant scrapers!

Anonymous said...

It's not an SEO tool!. It's scraping for information!

How do I know this? It's using our own search function to search for products on our eCommerce store. I'd never have noticed if our custom search log wasn't flooded with Barcode numbers for our products!

BAN BAN BAN! They'll mess up all your analytical statistics and drain your bandwidth!

Anonymous said...
This comment has been removed by a blog administrator.
igor said...

I have felt similar annoyance, and that is why I crafted my blacklist of IP addresses over many years. It is highly effective, when one has ftp access, but blogspot doesn't permit such things.

Anonymous said...

it keeps coming. just banned 198.143.158.202

anyone has a list of IP blocks for a mass block of Singlehop servers?

Anonymous said...

198.143.128.0/18 takes out a whole bunch of singlehop in one shot.

igor said...

It's amazing what a good blacklist can do. Almost no need for captcha anymore.

pensiunan bpkp said...

Just wrote on htaccess :

RewriteCond %{HTTP_USER_AGENT} BLEXBot
RewriteRule .* - [F]

problem solved :D

Russell said...

Hello Igor,

Thanks for the heads-up on this pesky crawler.

I had 992 "visits" from this thing before I noticed it and killed it.

All of the hits came from:

136.243.36.95

... which is part of the scraper/spammer heaven that your-server.de has become.

Thanks again for all the helpful info. I spent the last 2 hours reading some of your recent articles, and have subscribed to future updates.

igor said...

I have update my .htaccess blacklist. You are free to use it.

Designer said...

Yeah .htaccess blacklist is correct approach. WebMeUp said "only 1 request per 3 seconds" - so 100 bots will make "only" 100 requests per 3 seconds. And WebMeUp called themselves "gentle"...

Anonymous said...

The data collected by BLEXBOT is what powers SEO PowerSuite. We block their ip ranges outright for their ridiculous crawl behavior. In the past 9 months they have crawled us as follows:

IP REQUESTS
144.76.198.139 3
144.76.219.100 2031
148.251.0.23 763
148.251.10.183 124620
148.251.15.150 68748
148.251.21.227 18315
136.243.36.80 355
136.243.36.89 60902

techlorebyigor is my personal journal for ideas & opinions