Skip to content

Commit

Permalink
修改数据存储目录为: data 更直接
Browse files Browse the repository at this point in the history
  • Loading branch information
jiangzhonglian committed Feb 15, 2019
1 parent 23a80da commit 7ec9ddb
Show file tree
Hide file tree
Showing 69 changed files with 193 additions and 193 deletions.
2 changes: 1 addition & 1 deletion blog/ml/13.利用PCA来简化数据.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@
```python
def replaceNanWithMean():
datMat = loadDataSet('db/13.PCA/secom.data', ' ')
datMat = loadDataSet('data/13.PCA/secom.data', ' ')
numFeat = shape(datMat)[1]
for i in range(numFeat):
# 对value不为NaN的求均值
Expand Down
2 changes: 1 addition & 1 deletion blog/ml/14.利用SVD简化数据.md
Original file line number Diff line number Diff line change
Expand Up @@ -402,7 +402,7 @@ def imgCompress(numSV=3, thresh=0.8):
thresh 判断的阈值
"""
# 构建一个列表
myMat = imgLoadData('db/14.SVD/0_5.txt')
myMat = imgLoadData('data/14.SVD/0_5.txt')

print "****original matrix****"
# 对原始图像进行SVD分解并重构图像e
Expand Down
16 changes: 8 additions & 8 deletions blog/ml/15.大数据与MapReduce.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,15 +64,15 @@ cat inputFile.txt | python mapper.py | sort | python reducer.py > outputFile.txt
```
# 测试 Mapper
# Linux
cat db/15.BigData_MapReduce/inputFile.txt | python src/python/15.BigData_MapReduce/mrMeanMapper.py
cat data/15.BigData_MapReduce/inputFile.txt | python src/python/15.BigData_MapReduce/mrMeanMapper.py
# Window
# python src/python/15.BigData_MapReduce/mrMeanMapper.py < db/15.BigData_MapReduce/inputFile.txt
# python src/python/15.BigData_MapReduce/mrMeanMapper.py < data/15.BigData_MapReduce/inputFile.txt
# 测试 Reducer
# Linux
cat db/15.BigData_MapReduce/inputFile.txt | python src/python/15.BigData_MapReduce/mrMeanMapper.py | python src/python/15.BigData_MapReduce/mrMeanReducer.py
cat data/15.BigData_MapReduce/inputFile.txt | python src/python/15.BigData_MapReduce/mrMeanMapper.py | python src/python/15.BigData_MapReduce/mrMeanReducer.py
# Window
# python src/python/15.BigData_MapReduce/mrMeanMapper.py < db/15.BigData_MapReduce/inputFile.txt | python src/python/15.BigData_MapReduce/mrMeanReducer.py
# python src/python/15.BigData_MapReduce/mrMeanMapper.py < data/15.BigData_MapReduce/inputFile.txt | python src/python/15.BigData_MapReduce/mrMeanReducer.py
```

### MapReduce 机器学习
Expand All @@ -93,17 +93,17 @@ cat db/15.BigData_MapReduce/inputFile.txt | python src/python/15.BigData_MapRedu
* mrjob 是一个不错的学习工具,与2010年底实现了开源,来之于 Yelp(一个餐厅点评网站).

```Shell
python src/python/15.BigData_MapReduce/mrMean.py < db/15.BigData_MapReduce/inputFile.txt > db/15.BigData_MapReduce/myOut.txt
python src/python/15.BigData_MapReduce/mrMean.py < data/15.BigData_MapReduce/inputFile.txt > data/15.BigData_MapReduce/myOut.txt
```

> 实战脚本
```
# 测试 mrjob的案例
# 先测试一下mapper方法
# python src/python/15.BigData_MapReduce/mrMean.py --mapper < db/15.BigData_MapReduce/inputFile.txt
# python src/python/15.BigData_MapReduce/mrMean.py --mapper < data/15.BigData_MapReduce/inputFile.txt
# 运行整个程序,移除 --mapper 就行
python src/python/15.BigData_MapReduce/mrMean.py < db/15.BigData_MapReduce/inputFile.txt
python src/python/15.BigData_MapReduce/mrMean.py < data/15.BigData_MapReduce/inputFile.txt
```

### 项目案例:分布式 SVM 的 Pegasos 算法
Expand Down Expand Up @@ -213,7 +213,7 @@ def batchPegasos(dataSet, labels, lam, T, k):

[完整代码地址](https://github.com/apachecn/AiLearning/blob/master/src/py2.x/ml/15.BigData_MapReduce/pegasos.py): <https://github.com/apachecn/AiLearning/blob/master/src/py2.x/ml/15.BigData_MapReduce/pegasos.py>

运行方式:`python /opt/git/MachineLearning/src/python/15.BigData_MapReduce/mrSVM.py < db/15.BigData_MapReduce/inputFile.txt`
运行方式:`python /opt/git/MachineLearning/src/python/15.BigData_MapReduce/mrSVM.py < data/15.BigData_MapReduce/inputFile.txt`
[MR版本的代码地址](https://github.com/apachecn/AiLearning/blob/master/src/py2.x/ml/15.BigData_MapReduce/mrSVM.py): <https://github.com/apachecn/AiLearning/blob/master/src/py2.x/ml/15.BigData_MapReduce/mrSVM.py>

* * *
Expand Down
14 changes: 7 additions & 7 deletions blog/ml/2.k-近邻算法.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ knn 算法按照距离最近的三部电影的类型,决定未知电影的类

> 收集数据:提供文本文件
海伦把这些约会对象的数据存放在文本文件 [datingTestSet2.txt](/db/2.KNN/datingTestSet2.txt) 中,总共有 1000 行。海伦约会的对象主要包含以下 3 种特征:
海伦把这些约会对象的数据存放在文本文件 [datingTestSet2.txt](/data/2.KNN/datingTestSet2.txt) 中,总共有 1000 行。海伦约会的对象主要包含以下 3 种特征:

* 每年获得的飞行常客里程数
* 玩视频游戏所耗时间百分比
Expand Down Expand Up @@ -288,7 +288,7 @@ def datingClassTest():
# 设置测试数据的的一个比例(训练数据集比例=1-hoRatio)
hoRatio = 0.1 # 测试范围,一部分测试一部分作为样本
# 从文件中加载数据
datingDataMat, datingLabels = file2matrix('db/2.KNN/datingTestSet2.txt') # load data setfrom file
datingDataMat, datingLabels = file2matrix('data/2.KNN/datingTestSet2.txt') # load data setfrom file
# 归一化数据
normMat, ranges, minVals = autoNorm(datingDataMat)
# m 表示数据的行数,即矩阵的第一维
Expand Down Expand Up @@ -361,7 +361,7 @@ You will probably like this person: in small doses

> 收集数据: 提供文本文件
目录 [trainingDigits](/db/2.KNN/trainingDigits) 中包含了大约 2000 个例子,每个例子内容如下图所示,每个数字大约有 200 个样本;目录 [testDigits](/db/2.KNN/testDigits) 中包含了大约 900 个测试数据。
目录 [trainingDigits](/data/2.KNN/trainingDigits) 中包含了大约 2000 个例子,每个例子内容如下图所示,每个数字大约有 200 个样本;目录 [testDigits](/data/2.KNN/testDigits) 中包含了大约 900 个测试数据。

![手写数字数据集的例子](/img/ml/2.KNN/knn_2_handWriting.png)

Expand Down Expand Up @@ -402,7 +402,7 @@ array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1
def handwritingClassTest():
# 1. 导入训练数据
hwLabels = []
trainingFileList = listdir('db/2.KNN/trainingDigits') # load the training set
trainingFileList = listdir('data/2.KNN/trainingDigits') # load the training set
m = len(trainingFileList)
trainingMat = zeros((m, 1024))
# hwLabels存储0~9对应的index位置, trainingMat存放的每个位置对应的图片向量
Expand All @@ -412,17 +412,17 @@ def handwritingClassTest():
classNumStr = int(fileStr.split('_')[0])
hwLabels.append(classNumStr)
# 将 32*32的矩阵->1*1024的矩阵
trainingMat[i, :] = img2vector('db/2.KNN/trainingDigits/%s' % fileNameStr)
trainingMat[i, :] = img2vector('data/2.KNN/trainingDigits/%s' % fileNameStr)

# 2. 导入测试数据
testFileList = listdir('db/2.KNN/testDigits') # iterate through the test set
testFileList = listdir('data/2.KNN/testDigits') # iterate through the test set
errorCount = 0.0
mTest = len(testFileList)
for i in range(mTest):
fileNameStr = testFileList[i]
fileStr = fileNameStr.split('.')[0] # take off .txt
classNumStr = int(fileStr.split('_')[0])
vectorUnderTest = img2vector('db/2.KNN/testDigits/%s' % fileNameStr)
vectorUnderTest = img2vector('data/2.KNN/testDigits/%s' % fileNameStr)
classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3)
print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr)
if (classifierResult != classNumStr): errorCount += 1.0
Expand Down
4 changes: 2 additions & 2 deletions blog/ml/4.朴素贝叶斯.md
Original file line number Diff line number Diff line change
Expand Up @@ -502,11 +502,11 @@ def spamTest():
fullText = []
for i in range(1, 26):
# 切分,解析数据,并归类为 1 类别
wordList = textParse(open('db/4.NaiveBayes/email/spam/%d.txt' % i).read())
wordList = textParse(open('data/4.NaiveBayes/email/spam/%d.txt' % i).read())
docList.append(wordList)
classList.append(1)
# 切分,解析数据,并归类为 0 类别
wordList = textParse(open('db/4.NaiveBayes/email/ham/%d.txt' % i).read())
wordList = textParse(open('data/4.NaiveBayes/email/ham/%d.txt' % i).read())
docList.append(wordList)
fullText.extend(wordList)
classList.append(0)
Expand Down
6 changes: 3 additions & 3 deletions blog/ml/5.Logistic回归.md
Original file line number Diff line number Diff line change
Expand Up @@ -297,7 +297,7 @@ def plotBestFit(dataArr, labelMat, weights):
```python
def testLR():
# 1.收集并准备数据
dataMat, labelMat = loadDataSet("db/5.Logistic/TestSet.txt")
dataMat, labelMat = loadDataSet("data/5.Logistic/TestSet.txt")

# print dataMat, '---\n', labelMat
# 2.训练模型, f(x)=a1*x1+b2*x2+..+nn*xn中 (a1,b2, .., nn).T的矩阵值
Expand Down Expand Up @@ -576,8 +576,8 @@ def colicTest():
Returns:
errorRate -- 分类错误率
'''
frTrain = open('db/5.Logistic/horseColicTraining.txt')
frTest = open('db/5.Logistic/horseColicTest.txt')
frTrain = open('data/5.Logistic/horseColicTraining.txt')
frTest = open('data/5.Logistic/horseColicTest.txt')
trainingSet = []
trainingLabels = []
# 解析训练数据集中的数据特征和Labels
Expand Down
4 changes: 2 additions & 2 deletions blog/ml/6.支持向量机.md
Original file line number Diff line number Diff line change
Expand Up @@ -526,7 +526,7 @@ def smoP(dataMatIn, classLabels, C, toler, maxIter, kTup=('lin', 0)):
def testDigits(kTup=('rbf', 10)):

# 1. 导入训练数据
dataArr, labelArr = loadImages('db/6.SVM/trainingDigits')
dataArr, labelArr = loadImages('data/6.SVM/trainingDigits')
b, alphas = smoP(dataArr, labelArr, 200, 0.0001, 10000, kTup)
datMat = mat(dataArr)
labelMat = mat(labelArr).transpose()
Expand All @@ -544,7 +544,7 @@ def testDigits(kTup=('rbf', 10)):
print("the training error rate is: %f" % (float(errorCount) / m))

# 2. 导入测试数据
dataArr, labelArr = loadImages('db/6.SVM/testDigits')
dataArr, labelArr = loadImages('data/6.SVM/testDigits')
errorCount = 0
datMat = mat(dataArr)
labelMat = mat(labelArr).transpose()
Expand Down
4 changes: 2 additions & 2 deletions blog/ml/7.集成方法-随机森林和AdaBoost.md
Original file line number Diff line number Diff line change
Expand Up @@ -498,13 +498,13 @@ def adaClassify(datToClass, classifierArr):
```python
# 马疝病数据集
# 训练集合
dataArr, labelArr = loadDataSet("db/7.AdaBoost/horseColicTraining2.txt")
dataArr, labelArr = loadDataSet("data/7.AdaBoost/horseColicTraining2.txt")
weakClassArr, aggClassEst = adaBoostTrainDS(dataArr, labelArr, 40)
print weakClassArr, '\n-----\n', aggClassEst.T
# 计算ROC下面的AUC的面积大小
plotROC(aggClassEst.T, labelArr)
# 测试集合
dataArrTest, labelArrTest = loadDataSet("db/7.AdaBoost/horseColicTest2.txt")
dataArrTest, labelArrTest = loadDataSet("data/7.AdaBoost/horseColicTest2.txt")
m = shape(dataArrTest)[0]
predicting10 = adaClassify(dataArrTest, weakClassArr)
errArr = mat(ones((m, 1)))
Expand Down
22 changes: 11 additions & 11 deletions blog/ml/8.回归.md
Original file line number Diff line number Diff line change
Expand Up @@ -161,7 +161,7 @@ def standRegres(xArr,yArr):


def regression1():
xArr, yArr = loadDataSet("db/8.Regression/data.txt")
xArr, yArr = loadDataSet("data/8.Regression/data.txt")
xMat = mat(xArr)
yMat = mat(yArr)
ws = standRegres(xArr, yArr)
Expand Down Expand Up @@ -325,7 +325,7 @@ def lwlrTestPlot(xArr,yArr,k=1.0):

#test for LWLR
def regression2():
xArr, yArr = loadDataSet("db/8.Regression/data.txt")
xArr, yArr = loadDataSet("data/8.Regression/data.txt")
yHat = lwlrTest(xArr, xArr, yArr, 0.003)
xMat = mat(xArr)
srtInd = xMat[:,1].argsort(0) # argsort()函数是将x中的元素从小到大排列,提取其对应的index(索引),然后输出
Expand Down Expand Up @@ -418,7 +418,7 @@ def abaloneTest():
None
'''
# 加载数据
abX, abY = loadDataSet("db/8.Regression/abalone.txt")
abX, abY = loadDataSet("data/8.Regression/abalone.txt")
# 使用不同的核进行预测
oldyHat01 = lwlrTest(abX[0:99], abX[0:99], abY[0:99], 0.1)
oldyHat1 = lwlrTest(abX[0:99], abX[0:99], abY[0:99], 1)
Expand Down Expand Up @@ -540,7 +540,7 @@ def ridgeTest(xArr,yArr):

#test for ridgeRegression
def regression3():
abX,abY = loadDataSet("db/8.Regression/abalone.txt")
abX,abY = loadDataSet("data/8.Regression/abalone.txt")
ridgeWeights = ridgeTest(abX, abY)
fig = plt.figure()
ax = fig.add_subplot(111)
Expand Down Expand Up @@ -619,7 +619,7 @@ def stageWise(xArr,yArr,eps=0.01,numIt=100):

#test for stageWise
def regression4():
xArr,yArr=loadDataSet("db/8.Regression/abalone.txt")
xArr,yArr=loadDataSet("data/8.Regression/abalone.txt")
print(stageWise(xArr,yArr,0.01,200))
xMat = mat(xArr)
yMat = mat(yArr).T
Expand Down Expand Up @@ -745,12 +745,12 @@ def scrapePage(retX, retY, inFile, yr, numPce, origPrc):

# 依次读取六种乐高套装的数据,并生成数据矩阵
def setDataCollect(retX, retY):
scrapePage(retX, retY, 'db/8.Regression/setHtml/lego8288.html', 2006, 800, 49.99)
scrapePage(retX, retY, 'db/8.Regression/setHtml/lego10030.html', 2002, 3096, 269.99)
scrapePage(retX, retY, 'db/8.Regression/setHtml/lego10179.html', 2007, 5195, 499.99)
scrapePage(retX, retY, 'db/8.Regression/setHtml/lego10181.html', 2007, 3428, 199.99)
scrapePage(retX, retY, 'db/8.Regression/setHtml/lego10189.html', 2008, 5922, 299.99)
scrapePage(retX, retY, 'db/8.Regression/setHtml/lego10196.html', 2009, 3263, 249.99)
scrapePage(retX, retY, 'data/8.Regression/setHtml/lego8288.html', 2006, 800, 49.99)
scrapePage(retX, retY, 'data/8.Regression/setHtml/lego10030.html', 2002, 3096, 269.99)
scrapePage(retX, retY, 'data/8.Regression/setHtml/lego10179.html', 2007, 5195, 499.99)
scrapePage(retX, retY, 'data/8.Regression/setHtml/lego10181.html', 2007, 3428, 199.99)
scrapePage(retX, retY, 'data/8.Regression/setHtml/lego10189.html', 2008, 5922, 299.99)
scrapePage(retX, retY, 'data/8.Regression/setHtml/lego10196.html', 2009, 3263, 249.99)
```

> 测试算法:使用交叉验证来测试不同的模型,分析哪个效果最好
Expand Down
2 changes: 1 addition & 1 deletion docs/ml/13.PCA/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -798,7 +798,7 @@ <h4 id="_4">开发流程</h4>
<p>准备数据:将value为NaN的求均值</p>
</blockquote>
<pre><code class="python">def replaceNanWithMean():
datMat = loadDataSet('db/13.PCA/secom.data', ' ')
datMat = loadDataSet('data/13.PCA/secom.data', ' ')
numFeat = shape(datMat)[1]
for i in range(numFeat):
# 对value不为NaN的求均值
Expand Down
2 changes: 1 addition & 1 deletion docs/ml/14.SVD/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -1171,7 +1171,7 @@ <h3 id="svd_5">项目案例: 基于 SVD 的图像压缩</h3>
thresh 判断的阈值
&quot;&quot;&quot;
# 构建一个列表
myMat = imgLoadData('db/14.SVD/0_5.txt')
myMat = imgLoadData('data/14.SVD/0_5.txt')

print &quot;****original matrix****&quot;
# 对原始图像进行SVD分解并重构图像e
Expand Down
16 changes: 8 additions & 8 deletions docs/ml/15.BigData_MapReduce/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -787,15 +787,15 @@ <h3 id="hadoop-python">Hadoop 流(Python 调用)</h3>
</blockquote>
<pre><code># 测试 Mapper
# Linux
cat db/15.BigData_MapReduce/inputFile.txt | python src/python/15.BigData_MapReduce/mrMeanMapper.py
cat data/15.BigData_MapReduce/inputFile.txt | python src/python/15.BigData_MapReduce/mrMeanMapper.py
# Window
# python src/python/15.BigData_MapReduce/mrMeanMapper.py &lt; db/15.BigData_MapReduce/inputFile.txt
# python src/python/15.BigData_MapReduce/mrMeanMapper.py &lt; data/15.BigData_MapReduce/inputFile.txt

# 测试 Reducer
# Linux
cat db/15.BigData_MapReduce/inputFile.txt | python src/python/15.BigData_MapReduce/mrMeanMapper.py | python src/python/15.BigData_MapReduce/mrMeanReducer.py
cat data/15.BigData_MapReduce/inputFile.txt | python src/python/15.BigData_MapReduce/mrMeanMapper.py | python src/python/15.BigData_MapReduce/mrMeanReducer.py
# Window
# python src/python/15.BigData_MapReduce/mrMeanMapper.py &lt; db/15.BigData_MapReduce/inputFile.txt | python src/python/15.BigData_MapReduce/mrMeanReducer.py
# python src/python/15.BigData_MapReduce/mrMeanMapper.py &lt; data/15.BigData_MapReduce/inputFile.txt | python src/python/15.BigData_MapReduce/mrMeanReducer.py
</code></pre>

<h3 id="mapreduce">MapReduce 机器学习</h3>
Expand All @@ -815,17 +815,17 @@ <h3 id="mrjob-mapreduce">使用 mrjob 库将 MapReduce 自动化</h3>
<li>MapReduce 作业流自动化的框架:Cascading 和 Oozie.</li>
<li>mrjob 是一个不错的学习工具,与2010年底实现了开源,来之于 Yelp(一个餐厅点评网站).</li>
</ul>
<pre><code class="Shell">python src/python/15.BigData_MapReduce/mrMean.py &lt; db/15.BigData_MapReduce/inputFile.txt &gt; db/15.BigData_MapReduce/myOut.txt
<pre><code class="Shell">python src/python/15.BigData_MapReduce/mrMean.py &lt; data/15.BigData_MapReduce/inputFile.txt &gt; data/15.BigData_MapReduce/myOut.txt
</code></pre>

<blockquote>
<p>实战脚本</p>
</blockquote>
<pre><code># 测试 mrjob的案例
# 先测试一下mapper方法
# python src/python/15.BigData_MapReduce/mrMean.py --mapper &lt; db/15.BigData_MapReduce/inputFile.txt
# python src/python/15.BigData_MapReduce/mrMean.py --mapper &lt; data/15.BigData_MapReduce/inputFile.txt
# 运行整个程序,移除 --mapper 就行
python src/python/15.BigData_MapReduce/mrMean.py &lt; db/15.BigData_MapReduce/inputFile.txt
python src/python/15.BigData_MapReduce/mrMean.py &lt; data/15.BigData_MapReduce/inputFile.txt
</code></pre>

<h3 id="svm-pegasos">项目案例:分布式 SVM 的 Pegasos 算法</h3>
Expand Down Expand Up @@ -928,7 +928,7 @@ <h4 id="_3">开发流程</h4>
</code></pre>

<p><a href="https://github.com/apachecn/AiLearning/blob/master/src/py2.x/ml/15.BigData_MapReduce/pegasos.py">完整代码地址</a>: <a href="https://github.com/apachecn/AiLearning/blob/master/src/py2.x/ml/15.BigData_MapReduce/pegasos.py">https://github.com/apachecn/AiLearning/blob/master/src/py2.x/ml/15.BigData_MapReduce/pegasos.py</a></p>
<p>运行方式:<code>python /opt/git/MachineLearning/src/python/15.BigData_MapReduce/mrSVM.py &lt; db/15.BigData_MapReduce/inputFile.txt</code>
<p>运行方式:<code>python /opt/git/MachineLearning/src/python/15.BigData_MapReduce/mrSVM.py &lt; data/15.BigData_MapReduce/inputFile.txt</code>
<a href="https://github.com/apachecn/AiLearning/blob/master/src/py2.x/ml/15.BigData_MapReduce/mrSVM.py">MR版本的代码地址</a>: <a href="https://github.com/apachecn/AiLearning/blob/master/src/py2.x/ml/15.BigData_MapReduce/mrSVM.py">https://github.com/apachecn/AiLearning/blob/master/src/py2.x/ml/15.BigData_MapReduce/mrSVM.py</a></p>
<hr />
<ul>
Expand Down
Loading

0 comments on commit 7ec9ddb

Please sign in to comment.