/**
* One-hot encoding example function
* @param samples movie samples dataframe
*/
def oneHotEncoderExample(samples:DataFrame): Unit ={
val samplesWithIdNumber = samples.withColumn("movieIdNumber", col("movieId").cast(sql.types.IntegerType))
val oneHotEncoder = new OneHotEncoderEstimator()
.setInputCols(Array("movieIdNumber"))
.setOutputCols(Array("movieIdVector"))
.setDropLast(false)
val oneHotEncoderSamples = oneHotEncoder.fit(samplesWithIdNumber).transform(samplesWithIdNumber)
oneHotEncoderSamples.printSchema()
oneHotEncoderSamples.show(10)
}
val array2vec: UserDefinedFunction = udf {
(a: Seq[Int], length: Int) => org.apache.spark.ml.linalg.Vectors.sparse(length, a.sortWith(_ < _).toArray, Array.fill[Double](a.length)(1.0)) }
/**
* Multi-hot encoding example function
* @param samples movie samples dataframe
*/
def multiHotEncoderExample(samples:DataFrame): Unit ={
val samplesWithGenre = samples.select(col("movieId"), col("title"),explode(split(col("genres"), "\\|").cast("array<string>")).as("genre"))
val genreIndexer = new StringIndexer().setInputCol("genre").setOutputCol("genreIndex")
val stringIndexerModel : StringIndexerModel = genreIndexer.fit(samplesWithGenre)
val genreIndexSamples = stringIndexerModel.transform(samplesWithGenre)
.withColumn("genreIndexInt", col("genreIndex").cast(sql.types.IntegerType))
/* println("genreIndexSamples:")
genreIndexSamples.printSchema()
genreIndexSamples.show(10,false)
println("genreIndexSamples.agg:")
genreIndexSamples.agg(max(col("genreIndexInt"))).show(10,false)*/
val indexSize = genreIndexSamples.agg(max(col("genreIndexInt"))).head().getAs[Int](0) + 1
val processedSamples = genreIndexSamples
.groupBy(col("movieId")).agg(collect_list("genreIndexInt").as("genreIndexes"))
.withColumn("indexSize", typedLit(indexSize))
val finalSample = processedSamples.withColumn("vector", array2vec(col("genreIndexes"),col("indexSize")))
finalSample.printSchema()
finalSample.show(10,false)
}
注释:
StringIndexer的使用
lit和typeLit
collect_list
agg使用
spark的聚合函数
Raw Movie Samples:
root
|-- movieId: string (nullable = true)
|-- title: string (nullable = true)
|-- genres: string (nullable = true)
+-------+--------------------+--------------------+
|movieId| title| genres|
+-------+--------------------+--------------------+
| 1| Toy Story (1995)|Adventure|Animati...|
| 2| Jumanji (1995)|Adventure|Childre...|
| 3|Grumpier Old Men ...| Comedy|Romance|
| 4|Waiting to Exhale...|Comedy|Drama|Romance|
| 5|Father of the Bri...| Comedy|
| 6| Heat (1995)|Action|Crime|Thri...|
| 7| Sabrina (1995)| Comedy|Romance|
| 8| Tom and Huck (1995)| Adventure|Children|
| 9| Sudden Death (1995)| Action|
| 10| GoldenEye (1995)|Action|Adventure|...|
+-------+--------------------+--------------------+
only showing top 10 rows
OneHotEncoder Example:
root
|-- movieId: string (nullable = true)
|-- title: string (nullable = true)
|-- genres: string (nullable = true)
|-- movieIdNumber: integer (nullable = true)
|-- movieIdVector: vector (nullable = true)
+-------+--------------------+--------------------+-------------+-----------------+
|movieId| title| genres|movieIdNumber| movieIdVector|
+-------+--------------------+--------------------+-------------+-----------------+
| 1| Toy Story (1995)|Adventure|Animati...| 1| (1001,[1],[1.0])|
| 2| Jumanji (1995)|Adventure|Childre...| 2| (1001,[2],[1.0])|
| 3|Grumpier Old Men ...| Comedy|Romance| 3| (1001,[3],[1.0])|
| 4|Waiting to Exhale...|Comedy|Drama|Romance| 4| (1001,[4],[1.0])|
| 5|Father of the Bri...| Comedy| 5| (1001,[5],[1.0])|
| 6| Heat (1995)|Action|Crime|Thri...| 6| (1001,[6],[1.0])|
| 7| Sabrina (1995)| Comedy|Romance| 7| (1001,[7],[1.0])|
| 8| Tom and Huck (1995)| Adventure|Children| 8| (1001,[8],[1.0])|
| 9| Sudden Death (1995)| Action| 9| (1001,[9],[1.0])|
| 10| GoldenEye (1995)|Action|Adventure|...| 10|(1001,[10],[1.0])|
+-------+--------------------+--------------------+-------------+-----------------+
MultiHotEncoder Example:
genreIndexSamples:
root
|-- movieId: string (nullable = true)
|-- title: string (nullable = true)
|-- genre: string (nullable = true)
|-- genreIndex: double (nullable = false)
|-- genreIndexInt: integer (nullable = true)
+-------+-----------------------+---------+----------+-------------+
|movieId|title |genre |genreIndex|genreIndexInt|
+-------+-----------------------+---------+----------+-------------+
|1 |Toy Story (1995) |Adventure|6.0 |6 |
|1 |Toy Story (1995) |Animation|15.0 |15 |
|1 |Toy Story (1995) |Children |7.0 |7 |
|1 |Toy Story (1995) |Comedy |1.0 |1 |
|1 |Toy Story (1995) |Fantasy |10.0 |10 |
|2 |Jumanji (1995) |Adventure|6.0 |6 |
|2 |Jumanji (1995) |Children |7.0 |7 |
|2 |Jumanji (1995) |Fantasy |10.0 |10 |
|3 |Grumpier Old Men (1995)|Comedy |1.0 |1 |
|3 |Grumpier Old Men (1995)|Romance |2.0 |2 |
+-------+-----------------------+---------+----------+-------------+
genreIndexSamples.agg:
+------------------+
|max(genreIndexInt)|
+------------------+
|18 |
+------------------+
finalSample:
root
|-- movieId: string (nullable = true)
|-- genreIndexes: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- indexSize: integer (nullable = false)
|-- vector: vector (nullable = true)
+-------+------------+---------+--------------------------------+
|movieId|genreIndexes|indexSize|vector |
+-------+------------+---------+--------------------------------+
|296 |[1, 5, 0, 3]|19 |(19,[0,1,3,5],[1.0,1.0,1.0,1.0])|
|467 |[1] |19 |(19,[1],[1.0]) |
|675 |[4, 0, 3] |19 |(19,[0,3,4],[1.0,1.0,1.0]) |
|691 |[1, 2] |19 |(19,[1,2],[1.0,1.0]) |
|829 |[1, 10, 14] |19 |(19,[1,10,14],[1.0,1.0,1.0]) |
|125 |[1] |19 |(19,[1],[1.0]) |
|451 |[0, 8, 2] |19 |(19,[0,2,8],[1.0,1.0,1.0]) |
|800 |[0, 8, 16] |19 |(19,[0,8,16],[1.0,1.0,1.0]) |
|853 |[0] |19 |(19,[0],[1.0]) |
|944 |[0] |19 |(19,[0],[1.0]) |
+-------+------------+---------+--------------------------------+
另一种multi-hot方法(适合标签量不大的情况)
主要是靠获取getWordsIndexMap,然后做映射
def getWordsIndexMap(rdd: RDD[Set[String]], ss: SparkSession): Broadcast[Map[String, Int]] = {
val allWords = rdd.map {
x => (1, x) }.reduceByKey((x, y) => x ++ y).collect().head._2.toArray.sorted
val wordsMapbt = ss.sparkContext.broadcast(allWords.zip(0.until(allWords.length)).toMap)
wordsMapbt
}
def transformVec(rdd: RDD[(String, Set[String], String)], ss: SparkSession, mp: Broadcast[Map[String, Int]]) = {
import ss.sqlContext.implicits._
val indexDF = rdd.map {
x => x._1 }.distinct().zipWithUniqueId().toDF("id", "index")
val outRDD = rdd.toDF("id", "keywords", "from")
.join(indexDF, "id")
.select("index", "keywords", "from")
.rdd
.map {
case Row(index: Long, keywords: collection.mutable.WrappedArray[String], from: String) =>
val len = mp.value.size
val arr1 = keywords.toArray.sorted.map {
x =>
mp.value(x)
}
val arr2 = arr1.map {
x => 1.0 }
(index, Vectors.sparse(len, arr1, arr2), from)
}
(indexDF, outRDD)
}
文章浏览阅读2w次,点赞25次,收藏93次。Jupyter Notebook是非常方便的Python IDE,安装anaconda后Jupyter NoteBook也会安装好,使用起来十分方便。当anaconda使用的多了之后,会有创建多个虚拟环境来兼容不同版本的python环境及安装包的需求,这时再使用Jupyter Notebook时就需要制定使用的虚拟环境的需求。其实使用虚拟环境非常简单,只需要安装一个nb_conda包就可以直接使..._jupyter notebook使用conda
文章浏览阅读8.5k次,点赞19次,收藏81次。一、实验目的1.掌握离散傅里叶变换的计算机实现方法。2.掌握计算序列的圆周卷积的方法。3.学习用DFT对连续信号和时域离散信号进行谱分析的方法,了解可能出现的分析误差,以便在实际中正确应用DFT。4. 理解用FFT对周期序列进行频谱分析时所面临的问题并掌握其解决方法。5.掌握用时域窗函数加权处理的技术。6.理解用FFT对非周期信号进行频谱分析所面临的问题并掌握其解决方法。二、实验原理与方法1. 对周期序列进行频谱分析应注意的问题对时间序列作FFT时,实际上要作周期延拓(如果取长序列的一段_exp(1j*2*pi*freqs_new'*frames');
文章浏览阅读2.2k次,点赞3次,收藏6次。spring源码学习总结_怎么学习spring框架源码
文章浏览阅读4.8k次。文章目录前提:思路:参考tensorRT官方文档(证明在此份代码不可行,但是是可以序列话的)参考torch2trt官方git(这份代码适合,是TRTModule类型)前提:Jetson Nano 【8】 pytorch YOLOv3 直转tensorRT 的测试在使用这份代码的时候,每一次都需要重新转换,一次转换就需要5分钟,于是想着能不能将模型保存下来思路:- 1.python类..._trtmodule
文章浏览阅读341次,点赞4次,收藏4次。一天迅速入门JS。适合有其他语言基础,想要快速入门JavaScript的人员。_有代码基础的一天能学会js吗
文章浏览阅读6.7k次,点赞4次,收藏13次。 当我们满心欢喜的拿到一个数据集准备处理时,却发现特征都是中文的,顿时心中就打起了鼓来,不敢确定在处理数据,或者数据可视化时会出什么幺蛾子。但是没办法只能硬着头皮上啊。那么接下来李小宽带你来解决这个令人问题:问题:看吧,明明很不容易从几百个特征中挑出几个来想看看皮尔逊相关度矩阵,结果成了这样,全是方块。(加# -*- coding: utf-8 -*-也不管用sa)..._pyecharts出图有方块
文章浏览阅读50次。原文出处:https://github.com/springside/springside4/wiki/redis版本:V3.0.3 2013-8-1 (@江南白衣版权所有,转载请保留出处)1. Overview1.1 资料<The Little Redis Book>,最好的入门小册子,可以先于一切文档之前看,免费。作者Antirez的博客,Antirez...
文章浏览阅读2.1k次。在QGIS中根据一幅jpeg格式的地图半自动化绘制矢量地图_qgis rastertracer
文章浏览阅读10w+次,点赞5.6k次,收藏1.8w次。数据结构——二叉树先序、中序、后序三种遍历二叉树先序、中序、后序三种遍历三、代码展示:二叉树先序、中序、后序三种遍历先序遍历:3 2 2 3 8 6 5 4中序遍历:2 2 3 3 4 5 6 8后序遍历: 2 3 2 4 5 6 8 3三种遍历不同之处在,输出数据放在不同之处三、代码展示:#include<stdio.h>#include<stdlib.h>typedef struct Tree{ int_中序遍历
文章浏览阅读562次。今天想打开以前的标注文件发现打开不了命名后来反复试了好多遍,才发现是命名的问题。命名要设置成英文才可以在关闭后打开之前的文件。快捷键标标签时候选中框按数字键盘上的123,可以快速标注如果想持续的选择一个标签进行标注,可以按Ctrl+数字键,第几个数字就是第几个标,比如Ctrl+1就是选择第一个一直标。..._vott上的标注记录保存了找不到了
文章浏览阅读8k次,点赞2次,收藏12次。Launcher 总结: 1、launcher的布局太居中,要想两边拉伸 apps_customize_pageLayoutPaddingLeft">40dp apps_customize_pageLayoutPaddingRight">40dpapps tab栏的宽度( Launcher2 icon 数目、大小) \packages\apps\La_framework修改获取屏幕宽高的方法
文章浏览阅读5.6k次。eclipse maven中的jetty插件启动报错2016-06-15 09:17:24.871:WARN::Failed startup of context org.mortbay.jetty.plugin.Jetty6PluginWebAppContext@d72971{/services,E:\WorkSpaces\INAS_Provider\inas_web\target\service_failed startup of context org.mortbay.jetty.plugin.jetty6pluginwebappcontext