2018 Scrapy Environment Enhance(2)Proxy to Tor Network
Follow this Blog and Set the Proxy
https://gist.github.com/DusanMadar/8d11026b7ce0bce6a67f7dd87b999f6b
https://stackoverflow.com/questions/45009940/scrapy-with-privoxy-and-tor-how-to-renew-ip
Install Tor thing and Verify
> apt update
> apt install tor
Install Client Tool
> apt install netcat
Set Up Tor
> echo “ControlPort 9051” >> /etc/tor/torrc
> echo HashedControlPassword $(tor –hash-password “mypassword” | tail -n 1) >> /etc/tor/torrcpassword
Start Tor
> service tor start
Exception:
/etc/init.d/tor: line 140: ulimit: open files: cannot modify limit: Operation not permitted
Solution:
?
Verify the thing
> echo -e ‘AUTHENTICATE “password”‘ | nc 127.0.0.1 9051
Check the Public IP
> apt install curl
My Local IP
> curl http://icanhazip.com/
xxx.xxx.244.5
My Proxy IP
> torify curl http://icanhazip.com/
185.56.80.242
Change the IP
> echo -e ‘AUTHENTICATE “password”\r\nsignal NEWNYM\r\nQUIT’ | nc 127.0.0.1 9051
> torify curl http://icanhazip.com/
51.15.86.162
Change and Check IP in Python
> pip install stem
> python
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
>>>
>>> from stem import Signal
>>> from stem.control import Controller
>>> with Controller.from_port(port=9051) as controller:
… controller.authenticate()
… controller.signal(Signal.NEWNYM)
…
>>> exit()
> torify curl http://icanhazip.com/
142.4.211.161
Install Privoxy and Check
> apt install privoxy
Configure to connect to Tor
> echo “forward-socks5t / 127.0.0.1:9050 .” >> /etc/privoxy/config
Start the Service
> service privoxy start
> curl -x 127.0.0.1:8118 http://icanhazip.com/
142.4.211.161
Check all the things in Python3
> pip install requests
> python
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
>>>
>>> import requests
>>> from stem import Signal
>>> from stem.control import Controller
>>> response = requests.get(‘http://icanhazip.com/’, proxies={‘http’: ‘127.0.0.1:8118’})
>>> response.text.strip()
‘142.4.211.161’
>>> with Controller.from_port(port=9051) as controller:
… controller.authenticate(password=’password’)
… controller.signal(Signal.NEWNYM)
…
>>> response = requests.get(‘http://icanhazip.com/’, proxies={‘http’: ‘127.0.0.1:8118’})
>>> response.text.strip()
‘95.128.43.164’
At least it works there in Scrapy framework
class ChromeHeadlessMiddleware(object):
def process_request(self, request, spider):
#by pass the access deny
#https://intoli.com/blog/making-chrome-headless-undetectable/
options = webdriver.ChromeOptions()
options.add_argument(‘headless’)
options.add_argument(‘no-sandbox’)
options.add_argument(‘window-size=800×600’)
options.add_argument(‘user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36’)
options.add_argument(‘–proxy-server=127.0.0.1:8118’)
browser = webdriver.Chrome(chrome_options=options)
browser.switch_to.window(browser.window_handles[0])
browser.get(request.url)
body = browser.page_source
return HtmlResponse(browser.current_url, body=body, encoding=’utf-8′, request=request)
References:
http://sillycat.iteye.com/blog/2418229
http://neuralfoundry.com/scrapy-in-a-container-docker-development-environment/
https://github.com/dataisbeautiful/scrapy-development-docker
https://github.com/scrapy-plugins/scrapy-splash
https://www.jianshu.com/p/4052926bc12c
https://www.cnblogs.com/jclian91/p/8590617.html
IP Proxy Setting
https://free-proxy-list.net/
https://github.com/cnu/scrapy-random-useragent
https://github.com/aivarsk/scrapy-proxies
https://gist.github.com/seagatesoft/e7de4e3878035726731d
https://stackoverflow.com/questions/28852057/change-ip-address-dynamically
http://danielphil.github.io/raspberrypi/http/proxy/2015/04/01/raspberry-pi-http-proxy.html
https://docs.proxymesh.com/article/4-python-proxy-configuration
https://gist.github.com/DusanMadar/8d11026b7ce0bce6a67f7dd87b999f6b