web-Amazon

時間 2019-12-01

標籤 web amazon 欄目 HTML 简体版

原文原文鏈接

一準備實驗數據算法

1.1.下載數據json

wget http://snap.stanford.edu/data/amazon/all.txt.gz

1.2.數據分析bash

1.2.1.數據格式app

product/productId: B00006HAXW
product/title: Rock Rhythm & Doo Wop: Greatest Early Rock
product/price: unknown
review/userId: A1RSDE90N6RSZF
review/profileName: Joseph M. Kotow
review/helpfulness: 9/9
review/score: 5.0
review/time: 1042502400
review/summary: Pittsburgh - Home of the OLDIES
review/text: I have all of the doo wop DVD's and this one is as good or better than the
1st ones. Remember once these performers are gone, we'll never get to see them again.
Rhino did an excellent job and if you like or love doo wop and Rock n Roll you'll LOVE
this DVD !!

而，this

product/productId: asin, e.g. amazon.com/dp/B00006HAXW #亞馬遜標準識別號碼（英語：Amazon Standard Identification Number），簡稱ASIN（productId），是一個由十個字符（字母或數字）組成的惟一識別號碼。由亞馬遜及其夥伴分配，並用於亞馬遜上的產品標識。
product/title: title of the product
product/price: price of the product
review/userId: id of the user, e.g. A1RSDE90N6RSZF
review/profileName: name of the user
review/helpfulness: fraction of users who found the review helpful
review/score: rating of the product
review/time: time of the review (unix time)
review/summary: review summary
review/text: text of the review

1.2.2.數據格式轉換spa

首先，咱們須要把原始數據格式轉換成dictionary.net

import pandas as pd
import numpy as np
import datetime
import gzip
import json
from  sklearn.decomposition import PCA
from myria import *
import simplejson

def parse(filename):
    f = gzip.open(filename, 'r')
    entry = {}
    for l in f:
        l = l.strip()
        colonPos = l.find(':')
        if colonPos == -1:
            yield entry
            entry = {}
            continue
        eName = l[:colonPos]
        rest = l[colonPos+2:]
        entry[eName] = rest
    yield entry
    

f = gzip.open('somefile.gz', 'w')
#review_data = parse('kcore_5.json.gz')    
for e in parse("kcore_5.json.gz"):
        f.write(str(e))
f.close()

py文件執行時報錯： string indices must be intergers
unix

分析緣由：rest

在.py文件中寫的data={"a":"123","b":"456"}，data類型爲dictexcel

而在.py文件中經過data= arcpy.GetParameter(0) 獲取在GP中傳過來的參數{"a":"123","b":"456"}，data類型爲字符串！！！

因此在後續的.py中用到的data['a']就會報如上錯誤！！！

解決方案：

data= arcpy.GetParameter(0)

data=json.loads(data) //將字符串轉成json格式

或

data=eval(data) #本程序中咱們採用eval（）的方式，將字符串轉成dict格式

二.數據預處理

思路：

#import libraries

# Helper functions

# Prepare the review data for training and testing the algorithms

# Preprocess product data for Content-based Recommender System

# Upload the data to the MySQL Database on an Amazon Web Services ( AWS) EC2 instance

2.1建立DataFrame

f parse(path):
  f = gzip.open(path, 'r')
  for l in f:
    yield eval(l)

review_data = parse('/kcore_5.json.gz')
productID = []
userID = []
score = []
reviewTime = []
rowCount = 0

while True:
    try:
        entry = next(review_data)
        productID.append(entry['asin'])
        userID.append(entry['reviewerID'])
        score.append(entry['overall'])
        reviewTime.append(entry['reviewTime'])
        rowCount += 1
        if rowCount % 1000000 == 0:
            print 'Already read %s observations' % rowCount
    except StopIteration, e:
        print 'Read %s observations in total' % rowCount
        entry_list = pd.DataFrame({'productID': productID,
                                   'userID': userID,
                                   'score': score,
                                   'reviewTime': reviewTime})
        filename = 'review_data.csv'
        entry_list.to_csv(filename, index=False)
        print 'Save the data in the file %s' % filename
        break

entry_list = pd.read_csv('review_data.csv')

2.2數據過濾

def filterReviewsByField(reviews, field, minNumReviews):
    reviewsCountByField = reviews.groupby(field).size()
    fieldIDWithNumReviewsPlus = reviewsCountByField[reviewsCountByField >= minNumReviews].index
    #print 'The number of qualified %s: ' % field, fieldIDWithNumReviewsPlus.shape[0]
    if len(fieldIDWithNumReviewsPlus) == 0:
        print 'The filtered reviews have become empty'
        return None
    else:
        return reviews[reviews[field].isin(fieldIDWithNumReviewsPlus)]

def checkField(reviews, field, minNumReviews):
    return np.mean(reviews.groupby(field).size() >= minNumReviews) == 1

def filterReviews(reviews, minItemNumReviews, minUserNumReviews):
    filteredReviews = filterReviewsByField(reviews, 'productID', minItemNumReviews)
    if filteredReviews is None:
        return None
    if checkField(filteredReviews, 'userID', minUserNumReviews):
        return filteredReviews
    
    filteredReviews = filterReviewsByField(filteredReviews, 'userID', minUserNumReviews)
    if filteredReviews is None:
        return None
    if checkField(filteredReviews, 'productID', minItemNumReviews):
        return filteredReviews
    else:
        return filterReviews(filteredReviews, minItemNumReviews, minUserNumReviews)
        
def filteredReviewsInfo(reviews, minItemNumReviews, minUserNumReviews):    
    t1 = datetime.datetime.now()
    filteredReviews = filterReviews(reviews, minItemNumReviews, minUserNumReviews)
    print 'Mininum num of reviews in each item: ', minItemNumReviews
    print 'Mininum num of reviews in each user: ', minUserNumReviews
    print 'Dimension of filteredReviews: ', filteredReviews.shape if filteredReviews is not None else '(0, 4)'
    print 'Num of unique Users: ', filteredReviews['userID'].unique().shape[0]
    print 'Num of unique Product: ', filteredReviews['productID'].unique().shape[0]
    t2 = datetime.datetime.now()
    print 'Time elapsed: ', t2 - t1
    return filteredReviews

allReviewData = filteredReviewsInfo(entry_list, 100, 10)
smallReviewData = filteredReviewsInfo(allReviewData, 150, 15)

理論知識

1. Combining predictions for accurate recommender systems

So, for practical applications we recommend to use a neural network in combination with bagging due to the fast prediction speed.

Collaborative ltering（協同過濾，篩選類似的推薦）：電子商務推薦系統的主要算法，利用某興趣相投、擁有共同經驗之羣體的喜愛來推薦用戶感興趣的信息

更多相關文章...

相關標籤/搜索

硅谷

HTML

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。