astrobase.varclass.rfclass module¶
Does variable classification using random forests. Two types of classification are supported:
- Variable classification using non-periodic features: this is used to perform a
binary classification between non-variable and variable. Uses the features in
astrobase.varclass.varfeatures
andastrobase.varclass.starfeatures
. - TODO: Periodic variable classification using periodic features: this is used
to perform multi-class classification for periodic variables using the
features in
astrobase.varclass.periodicfeatures
andastrobase.varclass.starfeatures
. The classes recognized are listed in PERIODIC_VARCLASSES below and were generated from manual classification run on various HATNet, HATSouth and HATPI fields.
-
astrobase.varclass.rfclass.
collect_nonperiodic_features
(featuresdir, magcol, outfile, pklglob='varfeatures-*.pkl', featurestouse=['stetsonj', 'stetsonk', 'amplitude', 'magnitude_ratio', 'linear_fit_slope', 'eta_normal', 'percentile_difference_flux_percentile', 'mad', 'skew', 'kurtosis', 'mag_iqr', 'beyond1std', 'grcolor', 'gicolor', 'ricolor', 'bvcolor', 'jhcolor', 'jkcolor', 'hkcolor', 'gkcolor', 'propermotion'], maxobjects=None, labeldict=None, labeltype='binary')[source]¶ This collects variability features into arrays for use with the classifer.
Parameters: - featuresdir (str) – This is the directory where all the varfeatures pickles are. Use
pklglob to specify the glob to search for. The varfeatures pickles
contain objectids, a light curve magcol, and features as dict
key-vals. The
astrobase.lcproc.lcvfeatures
module can be used to produce these. - magcol (str) – This is the key in each varfeatures pickle corresponding to the magcol of the light curve the variability features were extracted from.
- outfile (str) – This is the filename of the output pickle that will be written containing a dict of all the features extracted into np.arrays.
- pklglob (str) – This is the UNIX file glob to use to search for varfeatures pickle files in featuresdir.
- featurestouse (list of str) – Each varfeatures pickle can contain any combination of non-periodic, stellar, and periodic features; these must have the same names as elements in the list of strings provided in featurestouse. This tries to get all the features listed in NONPERIODIC_FEATURES_TO_COLLECT by default. If featurestouse is provided as a list, gets only the features listed in this kwarg instead.
- maxobjects (int or None) – The controls how many pickles from the featuresdir to process. If None, will process all varfeatures pickles.
- labeldict (dict or None) –
If this is provided, it must be a dict with the following key:val list:
'<objectid>':<label value>
for each objectid collected from the varfeatures pickles. This will turn the collected information into a training set for classifiers.
Example: to carry out non-periodic variable feature collection of fake LCS prepared by
astrobase.fakelcs.generation
, use the value of the ‘isvariable’ dict elem from the fakelcs-info.pkl here, like so:labeldict={x:y for x,y in zip(fakelcinfo['objectid'], fakelcinfo['isvariable'])}
- labeltype ({'binary', 'classes'}) – This is either ‘binary’ or ‘classes’ for binary/multi-class classification respectively.
Returns: This returns a dict with all of the features collected into np.arrays, ready to use as input to a scikit-learn classifier.
Return type: dict
- featuresdir (str) – This is the directory where all the varfeatures pickles are. Use
pklglob to specify the glob to search for. The varfeatures pickles
contain objectids, a light curve magcol, and features as dict
key-vals. The
-
astrobase.varclass.rfclass.
train_rf_classifier
(collected_features, test_fraction=0.25, n_crossval_iterations=20, n_kfolds=5, crossval_scoring_metric='f1', classifier_to_pickle=None, nworkers=-1)[source]¶ This gets the best RF classifier after running cross-validation.
- splits the training set into test/train samples
- does KFold stratified cross-validation using RandomizedSearchCV
- gets the RandomForestClassifier with the best performance after CV
- gets the confusion matrix for the test set
Runs on the output dict from functions that produce dicts similar to that produced by collect_nonperiodic_features above.
Parameters: - collected_features (dict or str) – This is either the dict produced by a collect_*_features function or the pickle produced by the same.
- test_fraction (float) – This sets the fraction of the input set that will be used as the test set after training.
- n_crossval_iterations (int) – This sets the number of iterations to use when running the cross-validation.
- n_kfolds (int) – This sets the number of K-folds to use on the data when doing a test-train split.
- crossval_scoring_metric (str) –
This is a string that describes how the cross-validation score is calculated for each iteration. See the URL below for how to specify this parameter:
http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter By default, this is tuned for binary classification and uses the F1 scoring metric. Change the crossval_scoring_metric to another metric (probably ‘accuracy’) for multi-class classification, e.g. for periodic variable classification.
- classifier_to_pickle (str) – If this is a string indicating the name of a pickle file to write, will write the trained classifier to the pickle that can be later loaded and used to classify data.
- nworkers (int) – This is the number of parallel workers to use in the RandomForestClassifier. Set to -1 to use all CPUs on your machine.
Returns: A dict containing the trained classifier, cross-validation results, the input data set, and all input kwargs used is returned, along with cross-validation score metrics.
Return type: dict
-
astrobase.varclass.rfclass.
apply_rf_classifier
(classifier, varfeaturesdir, outpickle, maxobjects=None)[source]¶ This applys an RF classifier trained using train_rf_classifier to varfeatures pickles in varfeaturesdir.
Parameters: - classifier (dict or str) – This is the output dict or pickle created by get_rf_classifier. This will contain a features_name key that will be used to collect the same features used to train the classifier from the varfeatures pickles in varfeaturesdir.
- varfeaturesdir (str) – The directory containing the varfeatures pickles for objects that will be classified by the trained classifier.
- outpickle (str) – This is a filename for the pickle that will be written containing the result dict from this function.
- maxobjects (int) – This sets the number of objects to process in varfeaturesdir.
Returns: The classification results after running the trained classifier as returned as a dict. This contains predicted labels and their prediction probabilities.
Return type: dict
-
astrobase.varclass.rfclass.
plot_training_results
(classifier, classlabels, outfile)[source]¶ This plots the training results from the classifier run on the training set.
- plots the confusion matrix
- plots the feature importances
- FIXME: plot the learning curves too, see: http://scikit-learn.org/stable/modules/learning_curve.html
Parameters: - classifier (dict or str) – This is the output dict or pickle created by get_rf_classifier containing the trained classifier.
- classlabels (list of str) – This contains all of the class labels for the current classification problem.
- outfile (str) – This is the filename where the plots will be written.
Returns: The path to the generated plot file.
Return type: str