Prochain Science: 在 MNIST 上加入自己的訓練資料 - PNG 轉換, 打包及合併雙 Dataset 進行訓練

手寫辨識已經是 ML 界的 Hello World，但想要拿 MNIST 的 Digits 拿來辨識紙上的數字，顯然有一些不足，這可能是因為不同的國家、語言書寫方式影響數字的寫法及樣式，現有的 MNIST 資料庫雖然龐大，但儘管只有 60,000 多筆資料製作成的 weight model 中，想要把這些圖像上的數字拿來精準的辨識，是不太可能的事，我從資料的角度下手，把辨識完錯誤的圖片提取並且重新製作成 MNIST 的資料回頭訓練，希望可以加強辨識度。

這篇文章在做的事，就是紀錄把這些圖片轉回可訓練的 MNIST (idx3-ubyte, idx-ubyte) 中，得到一個合併的模型。

先說結論，這在我進行 PoC 驗證之後，發現這樣的 Retraining Correction 它並沒有實質的效果，姑且只能懷疑是否還需要對參數進行調整，也可以說是把訓練圖片的 Perceptron weight

進行調整後，才有可能得到優化後的結果。

提取資料的方法

注意，圖片最後的格式最好是符合正方形的大小，因為最後壓縮成 28*28 的圖片後，很可能會因為非等比壓縮 (即非用 1:1 長寬製作資料) 而失真。

這一個小章節並不適合所有狀況，只是單純記錄我如何把資料提取出來的。

假設是從現有的 handwritting 圖片讀取，想要用滑鼠 polygon 框出來這些手寫數字，思路:

用 pygame 上描點的方式，每次調用 pygame.draw.circle 就把 position 丟到座標變數 list 裡面
建立 Drag 拖移事件， MOUSEDOWN 就設定 is_drag = true, MOUSEUP 就設定 is_drag=false
每次 Drag 就可以記錄座標，不過 drag 事件是搭配 mousemotion 一起偵測:

drag_on and e.type == pygame.MOUSEMOTION

把紀錄座標提取，擷取畫面:

#crop image by array.
pts = np.array(drag_track)
if len(pts) < 10:
    continue
#save screen image
pygame.image.save(crope(screen), "tmp.png")

#read from cv
img = cv2.imread("tmp.png")

## crop bounding rectangle
rect = cv2.boundingRect(pts)
x,y,w,h = rect
croped = img[y:y+h, x:x+w].copy() #croped is image variable

## (2) make mask
pts = pts - pts.min(axis=0)

mask = np.zeros(croped.shape[:2], np.uint8)
cv2.drawContours(mask, [pts], -1, (255, 255, 255), -1, cv2.LINE_AA)

## (3) do bit-op
dst = cv2.bitwise_and(croped, croped, mask=mask)

## (4) add the white background
bg = np.ones_like(croped, np.uint8)*255
cv2.bitwise_not(bg,bg, mask=mask)
dst2 = bg+ dst

dst2 = 255 - dst2

croped_image = dst2 #final croped image variable receive

最後則把 croped_image 當作擷取的資料。

PNG 資料格式轉換為灰階

一般的 PNG 圖片即使儲存後，還是屬於 RGB, RGBA 模式，這並不能夠直接轉換成 MNIST(idx-ubyte) 的資料，它必須是灰階格式 (mnist 是灰階而不是二值)才可以被轉換。

使用 PIL Library 做圖片上的轉換:

from PIL import Image

#Consider whether you want to use binary image,use OpenCV
#croped_image = cv2.resize(croped_image, (28, 28),  interpolation=cv2.INTER_CUBIC)
#ret,croped_image = cv2.threshold(croped_image,190,255,cv2.THRESH_BINARY)

#convert image to binary
im = Image.open("tmp.png")
im = im.convert('L')
im.save(fname.name)

使用工具製作成 idx-ubyte 資料

製作 idx-ubyte 的 train, test 資料包可以使用 JPG-PNG-to-MNIST-NN-Format(link) 這個工具，從 GitHub 載下來後， Windows 環境可能需要更改 convert-images-to-mnist-format.py 這檔案，其中 15 行到 25 行需要更改:

for dirname in os.listdir(name[0])[0:]: # windows
 path = os.path.join(name[0],dirname)
 for filename in os.listdir(path):
  if filename.endswith(".png"):
   FileList.append(os.path.join(name[0],dirname,filename))

shuffle(FileList) # Usefull for further segmenting the validation set

for filename in FileList:
 print(filename)
 label = int(filename.split('\\')[1])# windows

注意到註解兩處是在 Windows 環境可能需要更改的。

完成後，把已經轉換的 PNG 資料丟到 test-images 或 train-images 資料夾底下，再用 class label 的名稱命名資料夾，丟到裡面，舉例像是類別為 8 的 PNG 圖片，位置是 ./train-images/8/xxx.png

完成後，直接執行這個 convert-images-to-mnist-format.py 就可以轉換出來檔案了。

即會產生這些 gz 檔案。

與 MNIST 資料集合併

要跟 MNIST 資料集合併，要先把上一小節取得的 .gz 檔案載入，以及 MNIST 原資料載入:

#讀取 mnist 資料，稱為 new
with open('./newDataset/train-images-idx3-ubyte.gz', 'rb') as f:
  train_images_new = extract_images(f)
  train_images_new = train_images_new.reshape(train_images_new.shape[0], 28, 28, 1)
with open('./newDataset/train-labels-idx1-ubyte.gz', 'rb') as f:
  train_labels_new = extract_labels(f)

with open('./newDataset/t10k-images-idx3-ubyte.gz', 'rb') as f:
  test_images_new = extract_images(f)
  test_images_new = test_images_new.reshape(test_images_new.shape[0], 28, 28, 1)
with open('./newDataset/t10k-labels-idx1-ubyte.gz', 'rb') as f:
  test_labels_new = extract_labels(f)

#讀取 mnist 資料，稱為 old
(train_images_old, train_labels_old), (test_images_old, test_labels_old) = keras.datasets.mnist.load_data()

#https://stackoverflow.com/questions/43153076/how-to-concatenate-numpy-arrays-into-a-specific-shape
#注意為度空間必須相同於 e.g 28*28
train_images_old = train_images_old.reshape(train_images_old.shape[0], 28, 28, 1)
test_images_old = test_images_old.reshape(test_images_old.shape[0], 28, 28, 1)

train_images, train_labels, test_images, test_labels = [None, None, None, None]

if type(train_images_old) == type(train_images_new):
    print("Type is same, concat/merge datasets...")
    train_images = np.concatenate((train_images_new, train_images_old))
    train_labels = np.concatenate((train_labels_new, train_labels_old))
    
    test_images = np.concatenate((test_images_new, test_images_old))
    test_labels = np.concatenate((test_labels_new, test_labels_old))
    print("Merge successed.")

如此就完資料集成合併，上述 code 中的 train_images, train_labels, test_images, test_labels 等四個變數是與廣用 MNIST 訓練程式命名幾乎一致，如果有異動被取名為 X, Y ，則 train_images 是 X, train_labels 是 Y。

詳情 ipynb 可以參考: https://gist.github.com/hpcslag/6d92d9e52b02def8025afc11d5850e07

Reference:

https://blog.csdn.net/icamera0/article/details/50843172

https://stackoverflow.com/questions/43153076/how-to-concatenate-numpy-arrays-into-a-specific-shape
https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html

https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html
https://stackoverflow.com/questions/53253465/how-to-combine-keras-mnist-dataset-with-my-own-mnist-images
https://towardsdatascience.com/how-to-use-dataset-in-tensorflow-c758ef9e4428
https://stackoverflow.com/questions/48771502/is-there-a-way-to-stack-two-tensorflow-datasets
https://www.tensorflow.org/api_docs/python/tf/data/Dataset
https://github.com/tensorflow/tensorflow/blob/r1.14/tensorflow/python/data/ops/dataset_ops.py#L1716-L1718
https://stackoverflow.com/questions/49981542/tensorflow-concat-tf-data-dataset-batches

https://stackoverflow.com/questions/45979848/merge-2-sequential-models-in-keras
https://github.com/keras-team/keras/issues/9969
https://github.com/tensorflow/tensorflow/issues/17471
https://github.com/tensorflow/tensorflow/issues/17364
https://blog.csdn.net/u010874976/article/details/78571788
http://rasbt.github.io/mlxtend/user_guide/data/loadlocal_mnist/
https://discuss.pytorch.org/t/combine-train-and-test-data/24004

Prochain Science

2019年8月1日星期四

在 MNIST 上加入自己的訓練資料 - PNG 轉換, 打包及合併雙 Dataset 進行訓練

提取資料的方法

PNG 資料格式轉換為灰階

使用工具製作成 idx-ubyte 資料

與 MNIST 資料集合併

沒有留言:

張貼留言

2019年8月1日 星期四

在 MNIST 上加入自己的訓練資料 - PNG 轉換, 打包及合併雙 Dataset 進行訓練

提取資料的方法

PNG 資料格式轉換為灰階

使用工具製作成 idx-ubyte 資料

與 MNIST 資料集合併

沒有留言:

張貼留言

2019年8月1日星期四