[4주차] - word count with Spark

it's me/👩‍💻 프로젝트 정리 2020. 2. 1. 02:39

서버개발캠프 4주차인 이번주부터는 본격적으로 팀 프로젝트를 시작했다.

4주차 나의 Milestone : spark 활용하여 word count 로 단어 분석 + api 설계
- 스파크 책 2.3장 학습
- 스파크 책 12.13장 학습
- 스파크 책 21장 스트리밍 처리 학습
- 실시간으로 데이터 받아와서 word count 실행

이 중 api 설계를 제외하고는 그래도 다 진행했다 !

⭐️ 이번주 정리

Spark - word count

Spark 에서는 기본적으로

Spark Streaming - 배치 처리보다 짧은 간격(마이크로 배치) 으로 데이터를 처리하는 스파크의 서브 모듈

kafka
flume
kinesis
Tcp sockets
파일 시스템 ex) hdfs , s3 ...

으로부터 data source 들을 받아올 수 있다.

나는 아직 kafka 연결을 하지 않았기 때문에 Tcp socket 통신을 이용하여 twitter api로 받아온 raw data 를 spark 와 연결했다.

~~사실 소켓 통신은 처음이라 이것부터 많이 삽질했다.... 후후~~

소켓 통신 흐름

간단 소켓 요약

서버

socket.socket() 을 통해 소켓 객체 생성
bind() - 특정 네트워크 인터페이스: Host 와 포트 번호 : Port 를 소켓과 연결
- HOST는 hostname, ip address, 빈 문자열 ""이 될 수 있다.
- 빈 문자열이면 모든 네트워크 인터페이스로부터의 접속을 허용
- PORT는 1-65535 사이의 숫자를 사용할 수 있다.
listen() - 서버가 클라이언트 접속 허용
accept() - 대기하다가 클라이언트가 접속하면 새로운 소켓 리턴
recv() - 클라이언트가 보낸 메세지 수신
send() - 클라이언트로 메세지 전송

클라이언트

socket.socket() - 클라이언트에서도 서버와 마찬가지로 소켓 객체 생성
connect() - 소켓과 연결된 host, port 로 접속
recv() - 서버로 부터 메세지 수신
send() - 서버로 메세지 송신

유의할 점 :
- 소켓으로 메세지 전송할 때는 byte 로 전송❗️따라서 twitter api 로 받은 tweet 들 encode('utf-8') 로 보내고 decode()로 읽어야 한다.
- tweet들 전부 다 보내려면 raw_tweet = json.dumps( data ).encode('utf-8') 를 통해 보낸다.

from tweepy import OAuthHandler
    from tweepy import Stream
    from tweepy.streaming import StreamListener
    import socket
    import json
    
    consumer_key = ''
    consumer_secret = ''
    access_token_key = ''
    access_token_secret = ''
    
    
    class TweetsListener(StreamListener):
    
      def __init__(self, csocket):
          self.client_socket = csocket
    
      def on_data(self, data):
          try:
              raw_tweet = json.loads( data )
              print(raw_tweet['text'])
              print( raw_tweet['text'].encode('utf-8') )
              self.client_socket.send( raw_tweet['text'].encode('utf-8') )
              return True
          except BaseException as e:
              print("Error on_data: %s" % str(e))
          return True
    
      def on_error(self, status):
          print(status)
          return True
    
    def sendData(c_socket):
      auth = OAuthHandler(consumer_key, consumer_secret)
      auth.set_access_token(access_token_key, access_token_secret)
    
      twitter_stream = Stream(auth, TweetsListener(c_socket)) # Create a Stream
      twitter_stream.filter(track=['#BTS']) # Starting a Stream
    
    if __name__ == "__main__":
      s = socket.socket()         # Create a socket object
      host = "127.0.0.1"          # Get local machine name
      port = 5555                 # Reserve a port for your service.
      s.bind((host, port))        # Bind to the port
    
      print("Listening on port: %s" % str(port))
    
      s.listen(5)                 # Now wait for client connection.
      c, addr = s.accept()        # Establish connection with client.
    
      print( "Received request from: " + str( addr ) )
    
      sendData( c )

이렇게 클라이언트와 서버를 한번에 처리할 수 있었는데,,

나는 클라 소켓 파일과 서버 소켓 파일을 따로 만들어서 실행 했다고 한다....그치만 이 과정을 통해 소켓 처리가 어떻게 되는지는 제대로 알게 되었다 !

Spark 연결

twitter api 를 통해 들어온 트윗들 클라이언트가 120.0.0.1:5555 소켓 서버로 보냄
StreamingContext, which is the main entry point for all streaming functionality.

Create a local StreamingContext with two execution threads, and batch interval of 5 second.

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working thread and batch interval of 5 second
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 5)

Using this context, create a DStream that represents streaming data from a TCP source, specified as hostname (e.g. localhost) and port (e.g. 5555).

socket_stream = ssc.socketTextStream("127.0.0.1", 5555)

lines DStream represents the stream of data that will be received from the data server.

lines = socket_stream.window(20)
words = lines.flatMap(lambda line: line.split(" "))
#Count each word in each batch
pairs = words.map(lambda word: (word, 1))
wordCounts = pairs.reduceByKey(lambda x, y: x + y)

# Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.pprint()

jupyter notebook 에서 2.3.4 번 실행 후 파이썬으로 작성한 소켓 파일 실행
ssc.start() 로 word count 실행

ssc.start()             # Start the computation
ssc.awaitTermination()  # Wait for the computation to terminate

결과

의미 없는 데이터가 대부분,,,,불필요한 것들 다 제거하고 count 순대로 출력할 것이다.

참고 :

트위터 스트리밍 처리 : https://hero0926.tistory.com/5
Spark Streaming Programming Guide : https://spark.apache.org/docs/latest/streaming-programming-guide.html

Spark Streaming - Spark 2.4.4 Documentation

Spark Streaming Programming Guide Overview Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Kinesi

spark.apache.org

'it's me > 👩‍💻 프로젝트 정리' 카테고리의 다른 글

[Quadcore Team/Trend] 인기 트윗 랭킹 (0)	2020.03.02
[Quadcore Team/Trend]해시태그 분석 (0)	2020.02.26
QuadCore Team Project (0)	2020.02.01
Authorization System (0)	2020.01.15

ABOUT ME

ⓓⓞⓑ ⓓⓞⓑ

'it's me > 👩‍💻 프로젝트 정리' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'it's me > 👩‍💻 프로젝트 정리' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바