玩轉資料與機器學習-以自然語言處理為例（ithome鐵人競賽）

發表於 2018-04-13 | 分類於 python |

Introduction

Github: https://github.com/GoatWang/ithome_ironman
Reader Feedback:
閱讀全文 »

多執行緒搭配非同步技術網頁爬取

發表於 2017-08-22 | 分類於 Python Note |

概述

在開始看這篇文章之前，非常建議大家先熟悉個別的技術: 非同步技術以及多執行緒網頁爬取技術。以下，我想針對「他們的差別」以及在「多執行緒的技術理解焦點」兩個部分進行簡單的說明。

閱讀全文 »

非同步的網頁爬取技術

發表於 2017-07-29 | 分類於 Python Note |

概述

在網頁的取得上，因為每次去要求server回傳html檔時，都要等待回應一段時間，此時client端(也就是你的電腦)其實是沒有在運算的，因此若能夠使用這段時間，發出其他要求，將可大大增加爬取的速度。不過，非同步技術的概念其實相當複雜，甚至牽涉到一些硬體的知識，比較主要的難點在於與「多執行續」的差異解釋，這裡就不多加解釋，有興趣可以自行google。

閱讀全文 »

給初學者的Python爬蟲學習架構

發表於 2017-07-29 | 分類於 Python Note |

目標

概述一些我接觸過的一些套件，讓大家對爬蟲的「技術鍊」、以及「常見的問題及其解決方式」有基礎的了解。

動機

這篇文章主要是寫給剛開始學習Python爬蟲的初學者，由於自己剛開始學習這部分知識時，所有的套件名詞猶如雪片般飛來，有時會錯誤的理解一個套件的使用方式，有時則對某個套件期待過高，學成時總覺得不過爾爾，有種失落感。因此著述。

閱讀全文 »

Use Facebook API to login Asp.Net Identity

發表於 2017-06-25 | 分類於 C# |

Introduction

Due to facebook api’s change on its query string. The built-in connecting system to facebook has some error. Unlike google authencation, you only have to input the ClientId, ClientSecret and set down the google end app, then you can successfully connect to google Api. However, you have to redefine the querystring in facebook authencation.

閱讀全文 »

Backup mlab Mongodb To Local Peroidically by C#

發表於 2017-06-24 | 分類於 C# |

Introduction

At first, I try to use mlab built-in backup system. However, it’s not include in its free 500mb program. As a result, I wrote a C# program to back up by myself.

閱讀全文 »

Train Wiki Corpus by gensim Word2vec

發表於 2017-06-06 | 分類於 Python Note |

What is Word2vec?

There are some keywords about the model for you to consider: unsupervised learning, LSTM, encode(translate) a word to a vector. Actually, I don’t know exactly about the theory of word2vec. What I can tell you is that, using models training from a corpus by this method, you can find related words a specific word or a words list. In addition, the most famous point of this theory is that the vectors of retlated words translated by this model can be caculated. For example, the (vector of) king - man = queen - woman. That is, you can caculate Taiwan’s vector by France - Paris + Taipei.

閱讀全文 »

LogLoss Function

發表於 2017-06-02 | 分類於 Python Note |

Function

From scikit-learn:
This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of the true labels given a probabilistic classifier’s predictions. The log loss is only defined for two or more labels. For a single sample with true label yt in {0,1} and estimated probability yp that yt = 1, the log loss is
$$-log P(yt|yp) = -(yt\ log(yp) + (1 - yt) log(1 - yp))$$

閱讀全文 »

Deploy Indri on Window 10 using Visual Studio

發表於 2017-05-17 | 分類於 Information Retrieval |

Introduction

Indri is a powerful IR tool. For more information, you can go to their home page. To be mentioned in this article, although they assert that it can be set up on Windows, it is quite a hard work. As a result, I wnat to write down the process about how to set it up on windows.

閱讀全文 »

IR2:資訊檢索的評價

發表於 2017-04-22 | 分類於 Information Retrieval |

一、前言

在評價資訊檢索時，人們在意的指標有很多面相，在過去比較重要的像是搜尋的數量跟速度，但隨著科技的進步，現在更趨向於不同面相精準，這也是本章節的重點。不過值得一提的是使用者介面(UI)、使用者體驗(UX)也是在這個領域當中有人持續關注及研究的議題，例如google使用的top-10 result method一個頁面中只回傳前十筆相關資料，又或者是Searchme Visual Search提供一個創新的搜尋結果可預覽的呈現方式，然而這些資訊實在太難以被量化研究，因此，本章將主要聚焦在精準度的呈現。

閱讀全文 »

Goat Wang

a hardworking beginner of programming

GitHub Linkedin Facebook